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STATISTICAL  METHODS  FOR  RESERVOIR  WATER 


QUALITY  INVESTIGATIONS 

PART  I:  INTRODUCTION 

Background 

1.  Through  its  Civil  Works  Program,  the  US  Army  Corps  of  Engineers 
(CE)  is  responsible  for  the  operation  of  a  large  number  of  water  resources 
projects.  Over  400  of  these  projects  are  reservoirs  which  are  operated 
for  a  number  of  purposes,  including  flood  control,  hydropower,  recreation, 
navigation,  and  water  supply.  In  addition,  the  CE  is  required  by  Federal 
legislation  in  the  operation  of  its  reservoirs  to  comply  with  Federal  and 
state  water  quality  requirements. 

2.  To  provide  policy  and  guidance  in  addressing  Federal  and  state 
water  quality  legislative  requirements,  the  CE  has  issued  Engineer  Regu¬ 
lations  (ER)  on  the  collection  and  interpretation  of  water  quality  data. 
Specifically: 

a.  ER  1110-2-334,  "Reporting  Water  Quality  Management  Activi¬ 
ties,"  established  the  consideration  of  water  quality  as 
an  integral  feature  of  CE  responsiblities  and  set  out  the 
requirements  for  monitoring  programs  and  the  reporting  of 
water  quality  data  collected  at  CE  reservoirs. 

.  ER  1110-2-415,  "Water  Quality  Data  Collection,  Interpreta¬ 
tion,  and  Application,"  established  guidelines  for  water 
quality  monitoring  programs  and  data  interpretation  at  CE 
projects. 

While  these  ERs  establish  the  general  guidelines  and  requirements  for 
assessing  water  quality,  they  do  not  provide  specific  assistance  to  the 
CE  Division  and  District  offices  in  developing  water  quality  programs, 
including  data  analysis  and  interpretation. 

3.  To  address  the  national  environmental  water  quality  objectives 
delineated  in  Federal  legislation,  in  1978  the  Office,  Chief  of  Engineers, 
instituted  a  major  research  effort,  the  Environmental  and  Water  Quality 
Operational  Studies  (EWQOS) .  The  EWQOS  Program  has  addressed  a  number 

of  environmental  quality  issues  and  provided  guidance  for  the  design 
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and  operation  of  CE  projects  with  respect  to  maintaining  or  enhancing 
environmental  quality  in  a  manner  that  is  compatible  with  project 
purpose . 

4.  One  of  the  major  efforts  of  the  EWQOS  Program,  Reservoir  Field 
Studies  (Work  Unit  VIIA),  was  initiated  to  develop  various  operational, 
control,  and  management  procedures  to  address  environmental  and  water 
quality  problems  at  CE  reservoirs.  One  specific  objective  of  this 
research  program  was  to  provide  guidance  on  the  design  of  District  and 
Division  reservoir  water  quality  sampling  programs.  This  report  is 
Intended  to  provide  guidance  to  field  personnel  in  the  data  analysis  and 
interpretation  phase  of  a  water  quality  monitoring  program. 

Purpose  and  Scope 

5.  The  purpose  of  this  report  is  to  provide  to  CE  Division  and 
District  personnel  a  general  introduction  to  the  statistical  analysis  of 
water  quality  data.  The  major  and  most  common  concepts  and  techniques 
involved  in  the  statistical  interpretation  of  water  quality  data  will  be 
discussed.  It  will  not  be  possible  to  provide  a  comprehensive  or 
thorough  treatment  of  all  of  the  statistical  methods  that  can  be  applied 
to  water  quality  data.  This  report  is  not  intended  to  replace  statisti¬ 
cal  textbooks,  rather  it  provides  the  necessary  background  for  more 
effective  and  efficient  use  of  those  reference  works.  Also,  this  report 
does  not  provide  specific  guidance  on  the  statistical  techniques  to  be 
used  for  data  interpretation.  The  application  of  specific  statistical 
methods  will  be  dependent  on  and  constrained  by  site-specific  features 
and  the  data  collected  in  a  water  quality  monitoring  program. 

6.  This  report  is  intended  for  use  by  all  personnel  involved  in 
the  design,  implementation,  and  data  interpretation  of  water  quality 
monitoring  programs  in  CE  reservoirs.  Most  of  the  information  contained 
in  this  report  is  discussed  in  a  more  detailed  manner  in  other  sources. 

A  number  of  introductory  statistics  textbooks  are  Identified  in  the  bib¬ 
liography,  and  it  is  suggested  that  at  least  one  of  these  be  on  hand  as 
this  report  is  read.  These  textbooks  also  contain  mathematical  and 
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statistical  tables  that  will  be  required  for  the  implementation  of  the 
techniques  presented  in  this  report. 

7.  The  presentation  of  the  material  in  this  report  is  divided 
into  several  parts.  Parts  II  and  III  discuss  the  use  of  data  displays 
and  descriptive  statistics ,  both  of  which  are  effective  means  of  sum¬ 
marizing  water  quality  data.  The  application  of  Inferential  statistics 
is  presented  in  Part  IV.  Inferential  statistics  are  required  to  make 
sound  conclusions  about  differences,  relationships,  or  trends  within  the 
data.  Part  V  presents  a  brief  introduction  to  the  statistical  concerns 
involved  in  sampling  program  design  that  are  necessary  for  the  proper 
execution  of  a  water  quality  monitoring  program.  A  glossary  of  statis¬ 
tical  terms  is  provided  as  Appendix  A. 
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PART  II:  DATA  DISPLAYS 


Introduction 


8.  It  Is  good  practice  in  statistical  analysis  to  begin  with  a 
study  involving  graphical  displays  of  the  data.  That  is,  before 
descriptive  statistics  are  calculated  from  a  data  set,  and  before  analy¬ 
ses  such  as  correlation,  analysis  of  variance,  or  linear  regression  are 
performed,  it  is  wise  to  look  at  various  displays  of  the  raw  data.  The 
graphs  recommended  for  this  task  are  useful  in  Identifying  the  need  to 
edit  or  transform  the  data  prior  to  conducting  the  statistical  analysis. 
Most  procedures  in  statistics  (e.g.,  regression  analysis,  hypothesis 
testing)  derive  summary  values  (e.g.,  mean  and  standard  deviation)  from 
a  data  set.  Thus,  if  the  Inferences  drawn  from  the  statistical  proce¬ 
dures  are  to  be  valid  for  the  entire  data  set,  it  is  Important  that  the 
summary  statistics  represent  the  entire  data  set.  The  graphical  dis¬ 
plays  help  guide  the  choice  of  any  necessary  manipulations  of  the  data 
set  and  the  selection  of  appropriate  statistics  (see  Part  III:  Basic 
Descriptive  Statistics)  to  summarize  the  data.  Examples  presented  in 
the  following  sections  should  underscore  the  importance  of  examining 
data  displays  at  the  beginning  of  a  statistical  study. 

9.  Graphs  can  also  be  useful  during  the  course  of  a  statistical 
study.  For  example,  bivariate  scatter  plots  are  helpful  in  the  selec¬ 
tion  of  independent  variables  for  a  regression  equation.  Upon  comple¬ 
tion  of  the  statistical  analysis,  the  scientist  often  wisely  chooses  to 
present  some  of  the  results  in  graphical  form.  Not  infrequently,  con¬ 
clusions  are  most  effectively  conveyed  in  a  graphical  display. 

Histograms 

10.  In  the  most  fundamental  study,  data  on  a  single  character¬ 
istic  are  analyzed.  For  example,  the  limnologlst  has  data  on  chloro¬ 
phyll  a  from  a  particular  reservoir  on  a  particular  date  and  desires  to 


summarize  the  information  obtained.  The  limnologlst  could  calculate  the 
mean  and  standard  deviation  of  the  sample  data  set;  alternatively,  he 
could  calculate  other  statistics  representing  central  tendency  and  dis¬ 
persion  (see  Part  III) .  Prior  to  calculating  any  statistics  for  the 
sample,  however,  the  limnologlst  should  first  look  at  a  plot  of  the 
data.  For  data  representing  a  single  characteristic  (such  as  chloro¬ 
phyll  cl),  the  histogram  is  often  a  useful  graphical  display. 

11.  As  an  example,  assume  that  the  summer  chlorophyll  a  data  in 
Table  1  have  just  been  acquired,  and  the  limnologlst  would  like  a  "pic¬ 
ture"  of  this  sample.  As  a  first  cut,  the  histogram  in  Figure  1  is 
plotted.  To  construct  the  histogram,  the  limnologlst  must  first  divide 
the  range  (highest  value  to  lowest  value)  into  equal-sized  intervals. 

In  Figure  1,  the  range  is  approximated  by  10  to  160  (actually  it  is  17.7 
to  150,  but  the  approximation  is  good  enough)  and  is  divided  into  inter¬ 
vals  of  10  units  (micrograms  per  litre).  For  each  interval,  10  to  20, 

30  to  40,  and  so  on,  simply  count  the  number  of  data  points  that  lie  in 
the  interval  and  construct  vertical  bars  with  height  proportional  to 
that  number.  So,  for  example,  there  are  three  observations  in  the 
40  to  50  range  and  six  observations  in  the  60  to  70  range.  Thus,  the 
bar  for  the  60  to  70  interval  is  twice  the  height  of  the  40  to  50  bar. 

12.  What  does  the  histogram  tell  us  about  the  sample?  Basically, 
it  provides  us  with  a  visual  image  of  the  distribution  of  the  sample. 

In  specific  terms,  this  means  that  we  are  able  to  quickly  see  such 
things  as  location  of  the  "center"  of  the  sample,  amount  of  "disper¬ 
sion,"  extent  of  "symmetry,"  and  existence  of  "outliers"  in  the  sample. 
In  Figure  1,  the  center  is  clearly  identified  by  the  peak  in  the 
50  to  60  interval,  and  dispersion  could  perhaps  be  characterized  by 
stating  that  about  85  percent  of  the  observations  lie  between  30  yg/i. 
and  80  yg /£.  The  histogram  is  not  symmetric,  however,  and  one  might 
want  to  check  on  the  validity  of  the  two  outlying  observations  at  the 
extreme  right. 

13.  The  picture  created  by  the  histogram  is  of  considerable  value 
in  the  selection  of  descriptive  statistics,  as  is  noted  in  the  next 
section.  Some  care  should  be  observed  in  the  construction  of  the 
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Figure  1.  Histogram  of  summer  chlorophyll  a  data  from  Table  1 

histogram,  however.  With  changes  in  interval  size,  the  histogram  may 
assume  different  shapes  which  might  affect  the  inferences  drawn.  For 
example,  in  Figure  2a  the  chlorophyll  a  data  are  plotted  using  an 
interval  size  of  20  units.  With  that  scale,  the  two  highest  observa¬ 
tions  are  less  likely  to  be  considered  outliers  than  they  are  on  the 
basis  of  the  five-unit  interval  size  histogram  in  Figure  2b.  It  is 
probably  good  practice  to  scale  the  histogram  so  that  the  observations 
are  neither  too  bunched  (as  in  Figure  2a,  where  75  percent  are  con¬ 
centrated  in  two  intervals)  nor  too  spread  out  to  permit  reasonable 
inferences  to  be  drawn. 

14.  As  noted  above,  the  histogram  provides  an  impression  of  the 
extent  of  symmetry  in  the  sample.  Symmetry  in  a  data  set  is  a  desir¬ 
able  attribute  for  two  reasons.  First,  it  often  means  that  one  can 
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Figure  2.  Histograms  of  summer  chlorophyll  a  data 

characterize  the  sample  as  having  a  distribution  with  a  shape  similar  to 
those  symmetric  distributions  (e.g.,  the  normal  and  uniform  distribu¬ 
tions)  which  are  commonly  an  assumption  of  statistical  analysis. 

Stating,  for  example,  that  a  sample  approximates  the  normal  distribution 
conveys  useful  information  to  a  reader.  Beyond  that,  symmetry  implies 


that  the  common  descriptive  statistics  such  as  the  mean  and  standard 
deviation  can  be  used  to  provide  an  adequate  summary  of  the  sample  (see 
Part  III). 

15.  The  foregoing  discussion  suggests  that  it  might  be  useful  to 
apply  a  transformation  (see  Part  III),  if  necessary,  in  order  to  create 
symmetry  in  an  asymmetric  data  set.  Fortunately,  limnological  data  are 
often  lognormally  distributed,  so  the  choice  of  transformation  is  often 
straightforward.  The  lognormal  distribution  is  strictly  positive  (all 
observations  >  0)  and  it  contains  skew  to  the  right.  As  an  example,  the 
chlorophyll  a  histogram  in  Figure  1  approximately  fits  this  description. 
To  check  for  lognormality,  the  logarithmic  transformation  is  applied  to 
the  data,  and  a  histogram  of  the  transformed  data  is  plotted.  Compari¬ 
son  of  this  histogram  with  a  normal  distribution  (i.e.,  a  bell-shaped 
curve)  provides  a  rough  test  of  lognormality;  formal  tests  do  exist 
(e.g.,  Kolmogorov-Smimov  test  or  chi-square  test)  and  may  be  found  in 
many  statistics  texts  (e.g.,  Wonnacott  and  Wonnacott  1972,  Benjamin  and 
Cornell  1970). 

16.  To  illustrate  how  a  transformation  may  change  the  shape  of  a 
histogram,  the  summer  chlorophyll  a  data  from  Table  1  were  log- 
transformed,  and  the  histogram  of  the  logarithms  of  the  chlorophyll  a 
observations  was  constructed  in  Figure  3.  Compare  Figure  1  with 
Figure  3.  Note  how  the  logarithmic  transformation  achieved  approximate 
symmetry.  Note  also  that  the  observations  at  the  extreme  right  are  less 
likely  to  be  considered  outliers  than  they  were  in  the  original  data. 

In  fact,  the  observation  in  Figure  3  at  the  extreme  left  is  further  from 
the  mean  of  the  log-transformed  sample  (the  geometric  mean)  than  is 
either  of  the  points  on  the  right.  This  is  a  result  of  the  effect  of 
the  logarithmic  transformation:  to  "spread  out"  low  values  and  "squeeze 
in"  high  values.  Through  the  study  of  histograms  of  this  sample,  we  are 
now  in  a  position  to  determine  descriptive  statistics  and  to  summarize 
the  data  set. 
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Figure  3.  Histogram  of  log-trans formed  chlorophyll  a  data 


Stem  and  Leaf  Displays 

17.  An  alternative  and  often  informative  version  of  the  histogram 
is  the  stem  and  leaf  display.  Developed  by  Tukey  (1977),  the  stem  and 
leaf  plot  provides  the  shape  of  a  histogram  while  at  the  same  time 
presenting  the  numerical  values  from  the  data  set.  As  an  example,  the 
stem  and  leaf  display  for  the  summer  chlorophyll  a  data  in  Table  1  is 
plotted  in  Figure  4;  note  that  the  shape  is  nearly  the  same  (round-off 
variations  create  the  slight  differences)  as  the  histogram  in  Figure  1. 

18.  To  construct  the  stem  and  leaf  diagram,  first  choose  the 
Interval  width.  The  "stem"  becomes  the  digit  level  corresponding  to 
this  interval  width  (for  Figure  4,  the  stem  contains  the  "tens"  digit 
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Figure  4.  Stem  and  leaf  display  of 
the  summer  chlorophyll  a  data 


since  the  interval  width  is  10  units) ,  and  values  for  the  stem  are 
placed  to  the  left  of  a  vertical  line.  On  the  right  side  of  this  line, 
the  "leaves"  are  written.  For  each  data  point,  the  leaf  is  the  next 
digit  lower  in  value  than  the  stems  digit.  Since  the  stems  in  Figure  4 
are  composed  of  the  tens  digit,  the  leaves  are  made  up  of  the  units 
digits.  Each  observation  contributes  one  leaf  to  the  row  containing  its 
stem.  For  the  summer  chlorophyll  a  data  in  Table  1,  the  first  observa¬ 
tion  (52.1  ug/i)  results  in  a  2  (the  units  digit)  placed  in  the  row  for 
the  stem  value  5  (the  tens  digit,  the  second  observation)  (55.6  pg/i, 
rounded  to  56)  results  in  a  6  placed  in  the  row  for  stem  value  5,  and  so 
on. 

19.  The  primary  advantage  of  the  stem  and  leaf  display  (over  the 
histogram)  is  that  it  contains  information  on  the  numerical  values  in 
the  data  set  (while  retaining  the  ability  to  provide  information  on  the 
shape  of  the  sample  distribution).  There  may  be  advantages  to  this, 
particularly  when  the  data  are  displayed  for  presentation  purposes. 

Tukey  (1977)  describes  several  variations  of  the  stem  and  leaf  display. 
Including  an  Interesting  way  to  look  at  covariation  in  bivariate  (e.g., 
chlorophyll  and  phosphorus)  data. 
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Box  Plots 


20.  Often  there  Is  a  need  to  compare  two  or  more  samples  of  the 
same  characteristic  (e.g.,  samples  for  chlorophyll  a  from  two  different 
sampling  stations  or  reservoirs) .  This  comparison  may  be  purely  sta¬ 
tistical,  perhaps  using  one  of  the  procedures  presented  in  Part  IV. 
Alternatively,  a  graphical  method  could  be  used  that  provides  both  a 
pictorial  comparison  as  well  as  a  statistical  comparison.  The  graphical 
display  that  permits  this  is  the  box  plot. 

21.  In  Figure  5,  two  box  plots  are  presented,  one  for  the  summer 
chlorophyll  a.  data  and  the  other  for  the  fall  chlorophyll  a  data  in 
Table  1.  (Assume  the  data  were  collected  at  two  different  times  of  the 
year  in  the  same  reservoir.)  Box  plots  are  based  on  order  statistics 
(Table  2).  These  are  observations,  like  the  median,  that  are  used  to 
summarize  a  sample  because  of  their  order  in  a  ranking  of  low  value  to 
high  value,  and  convey  information  on  the  sample  median,  dispersion, 
skew,  relative  size  of  the  data  set,  and  statistical  significance  of  the 
median.  To  construct  a  box  plot  for  a  sample  on  a  single  variable,  the 
steps  below  may  be  followed  (from  Reckhow  and  Chapra  1983): 

a.  Order  the  data  from  lowest  to  highest. 

b.  Plot  the  lowest  and  highest  values  on  the  graph  as  short 
horizontal  lines.  These  represent  the  extreme  values  for 
each  box  plot. 

c.  Determine  the  upper  and  lower  quartlles  for  the  data  set 
(see  Part  III).  These  values  define  the  positions  of  the 
upper  and  lower  edges  of  the  box.  Using  vertical  lines, 
connect  the  highest  value  with  the  upper  quartile  and  the 
lowest  value  with  the  lower  quartile. 

d.  Plot  the  median  as  a  dashed  horizontal  line  within  the 
box. 

e.  Select  a  scale  so  that  the  width  of  the  box  represents 
the  sample  size  (the  size  of  the  data  set  used  to  con¬ 
struct  each  box).  For  example,  each  centimetre  of  width 
could  represent  25  observations. 

Jf.  Determine  the  height  of  the  notch  (in  the  box  at  the 
median)  based  on  the  statistical  significance  of  the 
median.  Based  on  work  by  McGill,  Tukey,  and  Larsen 


(1978) ,  the  height  of  the  notch  above  and  below  the  median 
is  approximately: 

Notch  limits  -  Median  ±  £l. 7 (1.251/1. 35 

interquartile  range 
upper  quartile-lower  quartile 
sample  size 

With  this  mathematical  definition  of  the  notch  limits, 
the  notch  in  the  box  provides  an  approximate  95-percent 
confidence  interval  for  comparison  of  box  medians. 
Therefore,  when  the  notches  for  any  two  boxes  overlap  in 
a  vertical  sense,  the  medians  are  not  significantly  dif¬ 
ferent  at  about  the  95-percent  level. 


Table  2 

Order  Statistics  for  Chlorophyll  a  Data  Presented  in  Table  1 


Order  Statistics 

Summer 

Fall 

Minimum 

17.7 

1.2 

Maximum 

150.0 

66.3 

Median 

56.6 

23.0 

Quartiles 

Lower 

44.4 

14.9 

Upper 

63.5 

30.5 

Interquartile  range 

19.1 

15.6 

Notch  limits 

±5.8 

±4.7 

22.  The  firs 

t  o i  *:he  two  box  plots  in  Figure  5  is  labeled 

so  that 

the  characteristics 

mentioned  above  may  be  identified.  The  plot 

pro- 

vides  information  on  both  a  single  sample  and  a  comparison  among 
samples.  For  a  single  sample  we  see: 

a.  An  estj.mav.«;  of  the  center  of  the  sample  (the  median)  . 

b.  measure  of  dispersion  for  the  sample  (the  interquartile 
range) . 

c.  The  range  (highest  value  -  lowest  value)  and  an  impres¬ 
sion  of  skew  through  a  visual  comparison  of  the  symmetry 
above  and  below  the  median. 

23.  For  a  study  involving  two  or  more  samples  we  see: 

a.  A  statistical  test  of  significance  in  the  difference 
between  two  medians,  based  on  vertical  overlap  between 
notches. 

b.  A  visual  comparison  of  samples,  based  simply  on  observing 
the  similarities  and  differences  between  features  of  two 
box  plots. 

24.  Note  that  the  notches  for  the  two  box  plots  in  Figure  5  do 
not  overlap  in  a  vertical  sense,  indicating  that  the  median  chloro¬ 
phyll  a  observation  for  the  summer  sampling  date  is  significantly  dif¬ 
ferent  from  the  median  chlorophyll  a  observation  for  the  fall  sampling 
date.  The  skew  in  the  summer  sample  is  evident  from  the  elongated  upper 


tail,  and  the  summer  sample  as  a  whole  is  largely  above  the  fall  sample. 

25.  Box  plots  are  helpful  both  in  diagnostic  work  as  above  or  in 
presenting  conclusions  about  samples  following  the  completion  of  a  sta¬ 
tistical  study.  Reckhow  (1979)  describes  several  additions  to  the  basic 
box  plot  that  might  be  useful  in  limnological  analyses.  Tukey  (1977) 
created  the  box  plot  and  presents  many  interesting  examples  in  his  book 
on  exploratory  methods  in  statistics. 

Scatter  Plots 

26.  Many  statistics  (e.g.,  correlation  coefficients)  and  many 
statistical  methods  (e.g.,  regression  analysis)  are  fundamentally  con¬ 
cerned  with  relationships  between  pairs  of  variables.  Without  doubt, 
the  best  way  to  examine  a  relationship  between  pairs  of  variables,  a 
bivariate  relationship,  is  through  a  scatter  plot.  The  scatter  plot  is 
simply  a  two-variable  plot  of  observations  on  an  x-y  coordinate  system. 

27.  In  Figure  6,  a  bivariate  scatter  plot  is  presented  for  the 
data  on  summer  phosphorus  and  chlorophyll  a  in  Table  1.  From  the  plot, 
we  can  examine  the  distribution  of  data  for  each  of  the  variables 
separately  as  well  as  for  the  two  variables  together.  For  example,  we 
can  see  from  Figure  6  that  two  high  observations  for  chlorophyll  a  tend 
to  stand  apart  from  the  rest  of  the  data  (which  was  the  same  conclusion 
drawn  from  the  histogram  in  Figure  1).  Likewise,  one  observation  for 
phosphorus  tends  to  stand  apart  from  the  rest  of  the  data. 

28.  When  both  variables  in  Figure  6  are  examined  together,  we  see 
that  the  point  at  the  upper  right  of  the  plot  might  be  considered  an 
outlier.  With  this  point  removed,  the  linear  relationship  that  seems  to 
exist  in  Figure  6  is  much  less  apparent.  Therefore,  it  might  be  useful 
to  remove  this  point  from  the  sample,  replot  the  data,  and  effectively 
spread  out  the  remainder  of  the  observations  so  that  we  may  closely 
examine  the  pattern  in  the  cluster  of  points  at  the  lower  left. 

29.  Basically  two  characteristics  of  a  bivrriate  sample  are  of 
Interest  in  most  statistical  studies.  First,  the  analyst  often  is 
Interested  in  the  linearity  or  nonlinearity  in  the  relationship.  Linear 
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Figure  6.  Bivariate  scatter  plot  of  the  summer  total  phosphorus 

and  chlorophyll  a  data 

relationships  are  clearly  desirable  and  are  necessary  for  the  correct 
application  of  correlation  analysis  and  ordinary  least  squares  regres¬ 
sion.  If  the  bivariate  relationship  is  nonlinear,  it  is  possible  that  a 
transformation  (see  Part  III)  can  be  applied  to  make  it  linear.  Without 
question,  the  scatter  plot  is  the  most  Important  diagnostic  device  for 
evaluating  linearity,  and  it  is  often  quite  helpful  in  selecting  a 
transformation. 

30.  The  second  characteristic  of  a  bivariate  sample  of  particular 
concern  is  the  presence  or  absence  of  outliers.  Outliers  have  no 
universally  accepted  objective  definition;  rather  the  term  is  used  here 
to  identify  observations  that  stand  apart  from  a  cluster  of  points.  We 
are  concerned  about  outliers  because  they  exert  more  than  their  fair 
share  of  influence  on  the  value  of  statistics  (such  as  correlation 
coefficients  and  regression  coefficients).  Statistics  and  statistical 
Inferences  are  preferred  when  they  are  robust,  or  in  other  words,  when 
they  change  little  if  any  particular  observation  is  deleted  from  the 


19 


sample.  Outliers  can  have  a  substantial  Influence  on  certain  statis¬ 
tics;  therefore,  it  is  good  practice  either  to  transform  the  data  to 
reduce  the  influence  of  the  outlier  or  to  carefully  examine  the  outlier 
to  determine  if  it  is  a  correctly  measured,  legitimate  member  of  the 
population  sampled.  A  study  of  scatter  plots  is  the  best  way  to  check 
for  the  presence  of  outliers. 

31.  The  bivariate  scatter  plot  is  an  extremely  useful  diagnostic 
tool.  It  should  always  be  examined  near  the  beginning  of  any  work 
involving  the  study  of  covariation  in  pairs  of  variables.  Beyond  that, 
it  is  the  single  most  effective  way  to  convey  information  on  bivariate 
relationships  in  a  set  of  data.  Examples  illustrating  the  use  of 
scatter  plots  are  found  throughout  this  manual. 


PART  III:  BASIC  DESCRIPTIVE  STATISTICS 


Introduction 


32.  When  a  set  of  data  is  quite  small,  one  may  choose  to  present 
the  entire  data  set  in  a  report.  For  large  data  sets,  the  scientist 
recognizes  that  to  most  effectively  transfer  information  he  must  summa¬ 
rize  the  data  set  with  a  few  well-chosen  statistics.  A  choice  is  made 
to  trade  some  of  the  information  contained  in  the  entire  data  set  for 
the  convenience  of  a  few  descriptive  statistics.  This  choice  is  usually 
a  good  one,  provided  the  descriptive  statistics  that  are  selected  cor¬ 
rectly  represent  the  original  data. 

33.  Some  descriptive  statistics  are  so  commonly  used  we  forget 
that  they  actually  represent  only  one  option  among  many  candidate  sta¬ 
tistics.  For  example,  the  mean  and  the  standard  deviation  (or  variance) 
are  statistics  used  to  estimate  the  center  of  a  data  set  and  the  spread 
on  those  data.  When  these  statistics  are  to  be  used,  the  scientist 
should  decide  beforehand  that  they  are  the  best  choices  to  describe  the 
aforementioned  characteristics  of  the  data  set.  Often  they  are  (notably 
for  symmetrically  distributed  data  following  an  approximate  normal  dis¬ 
tribution)  ,  so  their  use  is  frequently  justified.  However,  as  we  see 
below,  there  are  many  situations  with  reservoir  water  quality  data  where 
alternative  descriptive  statistics  are  preferred. 

34.  In  the  selection  of  descriptive  statistics.  It  is  important 
that  the  scientist  have  a  clear  understanding  of  the  purpose  that  the 
statistic  serves.  Descriptive  statistics  are  selected  because  the  con¬ 
venience  of  a  few  summary  numbers  outweighs  the  loss  of  information  that 
results  when  the  entire  data  set  is  described  by  the  statistics.  It  is 
therefore  essential  that  as  much  information  as  possible  be  summarized 
in  the  descriptive  statistics  because  the  alternative  may  be  a  misrepre¬ 
sentation  of  the  original  data. 

35.  Certain  specific  features  of  the  data  set  are  characterized 
by  using  descriptive  statistics.  For  example,  the  center,  or  central 
tendency  of  a  set  of  data,  is  probably  the  most  important  measure. 


Among  the  candidate  statistics  are  the  mean,  median,  mode,  and  geometric 
mean.  Once  the  center  of  a  data  set  is  described,  the  next  important 
feature  requires  a  statistic  estimating  the  spread,  scale,  or  disper¬ 
sion.  Among  the  candidates  for  this  task  are  the  range,  standard  devia¬ 
tion,  and  interquartile  range.  These  two  characteristics  of  a  data  set, 
central  tendency  and  dispersion,  are  the  most  common  descriptive  statis¬ 
tics.  Other  characteristics,  such  as  skewness  and  kurtosis,  are  occa¬ 
sionally  important  as  well.  It  is  useful  now  to  look  at  some  examples 
that  illustrate  the  choice  of  descriptive  statistics. 

Measures  of  Central  Tendency 

36.  Probably  the  single  most  useful  statistic  summarizing  a  data 
set  is  an  indication  of  the  center  of  the  sample.  By  center  we  imply 
the  vague  notion  of  the  middle  of  the  cluster  of  points  or  perhaps  the 
region  of  greatest  concentration  of  points.  Since  samples  exhibit  a 
variety  of  distributions  when  plotted  as  histograms,  it  is  not  possible 
to  unambiguously  define  the  center,  and  as  a  result  there  are  several 
statistical  estimators  that  serve  as  candidates  for  determining  central 
tendency  or  location.  Each  candidate,  as  noted  below,  may  be  considered 
to  have  its  own  advantages  and  disadvantages  for  the  task  at  hand. 

Mean  (arithmetic) 

37.  The  arithmetic  mean,  or  simply,  the  mean,  is  the  most  fre¬ 
quently  used  of  the  central  tendency  estimators.  It  is  so  commonly  used 
that  the  investigator  often  loses  sight  of  the  true  reason  for  calculat¬ 
ing  descriptive  statistics.  The  result  is  that  the  mean  is  sometimes 
calculated  as  the  central  tendency  statistic  in  situations  where  another 
estimator  would  be  better. 

38.  The  arithmetic  mean  ^xj  is  the  sum  of  the  observations  (x^) 
divided  by  the  number  of  observations  (n) : 


n 
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Each  observation  contributes  its  magnitude  to  the  sum  of  the  observa¬ 
tions  and  hence  to  the  mean.  For  symmetric  distributions  (like  the 
normal  or  Gaussian  distribution),  this  is  desirable  and  leads  to  an 
efficient  (minimum  variance)  estimator.  However,  as  noted  in  Part  II, 
limnological  data  are  often  not  symmetrically  distributed.  Skewed  data 
"pull"  the  mean  in  the  direction  of  the  skew;  this  means  that  a  few 
extremely  high  observations  can  pull  the  mean  away  from  the  bulk  of  the 
observations  and  toward  the  few  high  data  points.  In  those  situations, 
a  robust  estimator,  like  the  median  or  the  mode,  might  be  preferred. 
Median 

39.  When  a  set  of  data  is  ordered  from  lowest  to  highest  value, 
the  median  is  identified  as  the  middle  value.  The  median  is  therefore 
known  as  an  "order  statistic"  since  it  is  based  on  an  ordering  or  rank¬ 
ing  of  observations.  When  the  total  number  of  observations  is  an  even 
number,  leading  to  two  middle  values,  the  median  is  then  the  average  of 
the  two  middle  values. 

40.  The  "order"  of  the  median  observation  is: 

Median  observation  *  (n  +  l)/2 

Since  the  effect  on  the  median  of  all  but  the  middle-ranking  observa¬ 
tions  is  simply  to  hold  a  place  in  the  ranking,  outlying  observations  do 
not  pull  the  median  toward  the  extremes.  The  median  is  robust  to  the 
influence  of  any  single  observation,  and  thus  it  is  a  good  statistic  to 
use  when  the  histogram  is  skewed  or  unusually  shaped. 

Mode 

41.  The  mode  is  the  value  in  the  sample  that  is  most  frequently 
observed.  In  terms  of  a  histogram,  the  mode  is  represented  by  the  bar 
of  greatest  height.  The  mode  is  considered  a  good  estimator  for  central 
tendency  because  the  most  frequently  observed  value  is  usually  near  the 
middle  of  a  distribution.  An  examination  of  a  histogram  of  the  sample 
will  indicate  whether,  in  fact,  the  mode  does  correspond  with  the 


center. 


Geometric  mean 

42.  The  geometric  mean  is  the  antilog  of  the  mean  of  logarith¬ 
mically  transformed  observations.  It  is,  therefore,  a  reasonable  mea¬ 
sure  of  central  tendency  for  a  set  of  data  that  exhibit  a  lognormal 
distribution.  It  may  also  be  determined  by  calculating  the  n  root 
of  a  product  of  n  observations: 

Geometric  mean  *  (llx)^n 

where  nx  =  x,  •  x,  •  x_  •  ...  •  x 
i  z  j  n 

l  log  (xi) 

Geometric  mean  ■  antilog  - 


43.  The  data  presented  in  Table  1  for  chlorophyll  a,  and  repro¬ 
duced  in  a  histogram  in  Figure  1,  are  used  to  calculate  statistics  for 
central  tendency;  these  values  are  listed  in  Table  1.  Note  that  the 
mode  is  given  as  the  range  of  values  corresponding  to  the  highest  bar  on 
the  histogram;  it  is  not  meaningful  to  identify  a  particular  chloro¬ 
phyll  a  value  (in  units  of  0.1  pg/O  as  the  mode  in  this  example  because 
few  values  are  duplicated.  Some  skewness  is  apparent  in  the  histogram 
in  Figure  1,  and  these  data  appear  to  have  an  approximate  lognormal  dis¬ 
tribution.  With  skewed  data,  the  mean  is  "pulled"  to  the  right  relative 
to  the  median  and  the  geometric  mean.  Thus,  the  mean  in  Table  3  is 
higher  than  the  median  and  the  geometric  mean. 


Measures  of  Dispersion 


44.  Other  than  central  tendency,  measures  of  dispersion  or  spread 
are  the  most  commonly  cited  statistics  used  to  summarize  a  data  set. 
Dispersion  in  a  data  set  refers  to  the  variability  in  the  observations 
about  the  center  of  the  distribution.  Good  measures  of  dispersion  will 
be  obtained  from  symmetric  distributions.  Asymmetry,  or  skewness,  will 
affect  the  estimate  of  dispersion  so  that  it  overestimates  spread  in  the 


shorter  tail  of  the  data  distribution  (while  underestimating  the  spread 
in  the  longer  tail).  A  transformation  may  be  used  to  create  a  symmetric 
distribution. 


Table  3 

Measures  of  Central  Tendency  for  the  Summer 
Chlorophyll  a  Data  in  Table  1 


Measure 

Value 

Mean 

59.8 

Median 

56.6 

Mode 

50-60 

Geometric  mean 

54.6 

Standard  deviation 

45.  The  most  commonly  used  statistic  for  dispersion  is  the 
standard  deviation.  Like  the  mean,  the  standard  deviation  has  been  used 
so  often  that  it  sometimes  is  thought  to  be  equivalent  in  definition  to 
dispersion.  In  fact,  like  the  mean,  the  standard  deviation  is  strongly 
affected  by  extreme  values.  Thus,  the  standard  deviation  for  a  distri¬ 
bution  of  data  with  a  long  tail  to  the  right  (e.g.,  the  histogram  in 
Figure  1)  is  Inflated  by  the  values  at  the  extreme  right.  It  may  be 
preferable  to  apply  a  transformation  to  create  a  symmetric  distribution 
before  calculating  the  standard  deviation. 

2 

46.  For  a  sample,  the  sample  variance  (s  )  is: 

2  iCxj^  -  x)2 


and  the  sample  standard  deviation  (s)  is  the  square  root  of  the  vari- 

J~2 

ance  \s 

Interquartile  range 

47.  Since  the  standard  deviation  is  unduly  Influenced  by  extreme 
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observations  in  asymmetric  distributions  of  data,  we  would  like  a  robust 
alternative  to  the  standard  deviation  (like  the  median  is  to  the  mean) 
for  situations  in  which  the  data  are  skewed  but  a  transformation  is 
undesirable.  Fortunately  a  good  alternative  exists:  the  interquartile 
range.  Like  the  median,  the  interquartile  range  is  based  on  order  sta¬ 
tistics,  and  thus  is  unaffected  by  the  magnitude  of  the  extreme  observa¬ 
tions  in  either  tail.  It  is  calculated  as  the  difference  between  the 
observation  at  the  75-percent  level  (upper  quartile)  and  the  observation 
at  the  25-percent  level  (lower  quartile): 

Lower  quartile  rank  order  “1/2  (1  +  median  rank  order) 

Upper  quartile  rank  order  =1/2  (1  +  n  -  low  quartile  rank) 


Interquartile  range 
value 


lower  quartile  value  -  upper  quartile 


The  interquartile  range  is  used  as  the  measure  of  dispersion  in  the  box 
plot  presented  in  Part  II. 

Range 

48.  An  easily  determined  and  therefore  frequently  cited  measure 
of  dispersion  is  the  range.  The  range  is  simply  the  maximum  value  minus 
the  minimum  value.  Since  it  is  clearly  affected  by  the  magnitude  of  the 
observations  at  either  extreme,  the  range  should  not  be  relied  upon  as 
the  sole  indicator  of  variability.  Nonetheless,  it  is  often  informative 
to  list  the  range  along  with  one  of  the  other  two  dispersion  statistics 
mentioned  above. 

49.  In  Table  4,  measures  of  dispersion  have  been  calculated  for 
the  summer  chlorophyll  a  data  presented  in  Table  1.  The  range,  of 
course,  is  largest  in  magnitude.  The  skewness  in  the  data  results  in  a 
standard  deviation  that  is  next  largest  in  magnitude  of  those  statistics 
presented.  If  the  two  largest  chlorophyll  a  observations  are  removed 
from  the  data  set,  the  standard  deviation  drops  from  27.9  yg/i  to 
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Table  4 


Measures  of  Dispersion  for  the  Summer 


Chlorophyll  a  Data  in  Table  1 


_ Measure _  Value 

Standard  deviation  27.9 
Interquartile  range  19.1 
Range  132.7 
Antilog  SD  (log  CHLfl)*  24.5 


*  1/2  [antilog  (mean  +  std  dev)  - 

antilog  (mean  -  std  dev) ]  for  log- 
transformed  chlorophyll  a  data. 

15.3  yg/i.  This  is  a  substantial  effect  due  to  only  2  of  27  observa¬ 
tions,  and  it  underscores  the  Impact  that  extreme  observations  have  on 
the  standard  deviation. 

50.  Since  two  observations  greatly  affect  the  value  of  the  stan¬ 
dard  deviation  for  the  data  in  Table  1,  and  if  there  is  no  basis  for 
removal  of  these  observations  from  the  data  set,  then  it  may  be  wise  to 
state  that  the  data  are  skewed  right  and  use  one  of  the  other  measures 
of  dispersion.  In  Table  4,  both  the  interquartile  range  and  the  antilog 
SD  lie  between  the  two  standard  deviations  previously  cited  (n  ■  27  and 
n  *  25),  and  thus  may  represent  good  compromise  choices.  The  antilog  SD 
is  a  reasonable  expression  of  the  standard  deviation  (in  antilog  units) 
for  a  data  set  that  has  a  lognormal  distribution.  Given  the  familiarity 
of  applied  scientists  with  various  measures  of  dispersion,  a  good  rule 
of  thumb  might  be  to  cite  the  standard  deviation  and  the  range  for  sym¬ 
metric  data  sets,  and  the  interquartile  range  and  the  range  for  asym¬ 
metric  data  sets. 

Ranks 

51.  On  occasion  it  is  preferable  to  examine  and  test  a  data  set 
on  the  basis  of  the  rank  order  of  observations.  In  those  situations. 


the  observations  In  a  data  set  are  simply  ordered  from  low  value  to  high 
value  according  to  one  particular  variable.  As  an  example*  the  data  set 
presented  In  Table  1  has  been  ordered  according  to  the  magnitude  of  the 
chlorophyll  a  observations  and  Is  presented  In  Table  5. 


Table  5 


Summer  Chlorophyll  a  Data  of  Table  1 


Ordered  by  Magnitude 


Order 

CHLa  (yg /£) 

Order 

CHLa  (yg/i) 

1 

17.70 

15 

57.60 

2 

25.50 

16 

59.40 

3 

30.00 

17 

60.30 

4 

37.70 

18 

61.20 

5 

39.50 

19 

61.90 

6 

42.50 

20 

63.10 

7 

42.60 

21 

63.80 

8 

46.10 

22 

67.40 

9 

52.10 

23 

74.80 

10 

52.20 

24 

76.00 

11 

53.60 

25 

79.50 

12 

53.60 

26 

133.40 

13 

55.60 

27 

14 

56.60 

52.  Ranks  or  ordered  data  are  useful  In  nonparametrlc  analyses 
(see  section*  Nonparametrlc  Analyses)  and  in  exploratory  data  analysis 
i  (see  Part  II).  In  particular,  when  the  assumption  of  normality  Is  not 

reasonable,  or  when  the  underlying  probability  distribution  (generating 
a  set  of  data)  is  unknown,  rank-based  statistics  and  statistical  tests 

i 

•  are  often  appropriate. 

I 

I 

I 

I 


28 


Frequencies 


53.  Once  a  data  set  is  rank  ordered,  it  is  often  useful  to  group 
the  data  and  present  the  frequency  of  observations  found  within  a 
subsection  of  the  entire  range.  This  is  done  graphically  in  Part  II 
(Data  Displays) ;  both  the  stem  and  leaf  diagram  and  the  histogram  are 
graphical  displays  of  the  frequency  of  an  observation  for  equally  spaced 
intervals  of  the  range.  For  example,  the  bars  in  the  histogram  in 
Figure  1  have  a  relative  height  proportional  to  the  relative  frequency 
of  observations  within  each  class. 

54.  In  Table  6,  the  absolute  and  relative  frequencies  are 
presented  for  each  of  the  cells  (or  classes)  in  Figure  1.  The  absolute 
frequency  is  expressed  in  terms  of  number  of  observations  within  each 
class,  and  the  relative  frequency  is  expressed  as  the  percent  (of  the 
total  number  of  observations)  contained  within  each  class.  Cumulative 
frequencies  are  also  presented  in  Table  6;  these  indicate  the  relative 
or  absolute  number  of  observations  less  than  or  equal  to  a  particular 
class  level.  Thus,  in  Table  6,  it  is  indicated  that  92.59  percent  of 
the  chlorophyll  a  observations  are  less  than  80  yg/Jt. 

Transformations 


55.  It  is  often  necessary  to  apply  a  transformation  to  reservoir 
water  quality  data  in  order  to  meet  the  assumptions  of  the  procedures. 
For  example,  methods  of  estimation  (e.g.,  regression  analysis)  and 
hypothesis  testing  are  based  on  certain  assumptions  about  the  observa¬ 
tions.  In  some  cases,  it  is  permissible  to  violate  the  assumptions 
without  greatly  affecting  the  analysis;  alternatively,  it  is  sometimes 
possible  to  apply  a  method  (e.g.,  distribution-free  procedures)  with 
less  restrictive  assumptions.  However,  there  are  still  likely  to  be 
situations  in  which  the  assumptions  must  be  approximately  met  and  in 
which  the  best  approach  is  to  apply  a  transformation  to  the  data. 

56.  Transformations  are  most  commonly  used  in  data  analysis  to 
reexpress  a  data  set  so  that  it  is  more  consistent  with  the  important 
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Table  6 

Absolute  and  Relative  Frequencies  for  the  Summer 


Chlorophyll  a  Data  In  Table  1 


Cumulative 


Class  Limits 

Frequency 

Percent 

Frequency 

Percent 

10  <  20 

1 

3.70 

1 

3.70 

20  <  30 

1 

3.70 

2 

7.41 

30  <  40 

3 

11.11 

5 

18.52 

40  <  50 

3 

11.11 

8 

29.63 

50  <  60 

8 

29.63 

16 

59.26 

60  <  70 

6 

22.22 

22 

81.48 

70  <  80 

3 

11.11 

25 

92.59 

80  <  90 

0 

0.00 

25 

92.59 

90  <  100 

0 

0.00 

25 

92.59 

100  <  110 

0 

0.00 

25 

92.59 

110  <  120 

0 

0.00 

25 

92.59 

120  <  130 

0 

0.00 

25 

92.59 

130  <  140 

1 

3.70 

26 

96.30 

140  <  150 

0 

0.00 

26 

96.30 

150  <  160 

_1 

3.70 

27 

100.00 

Total  27 

100.00 

assumptions 

(e.g.,  normality)  of  a 

statistical  analysis. 

and/or  to 

diminish  the 

impact  of  outlying  observations 

(see  section 

Scatter  Plots, 

Part  II).  To  achieve  these  objectives,  transformations  may: 

a.  Straighten  (linearize)  a  nonlinear  relationship  between 
two  variables, 

b.  Reduce  skew  (achieve  symmetry)  in  a  data  set  for  a  single 
variable,  and/or 

c.  Stabilize  variance  (create  constant  variance)  for  a 
particular  variance  across  two  or  more  data  sets. 

57.  Selection  of  a  transformation  for  these  three  functions  is 


beyond  the  scope  of  this  manual,  but  fortunately  it  is  quite  clearly  and 
simply  presented  by  Velleman  and  Hoaglin  (1981)  using  transformations  of 
selected  order  statistics.  Reservoir  water  quality  data  are  often 
skewed  right,  exhibiting  an  approximate  lognormal  distribution.  When 
this  occurs,  the  logarithmic  transformation  is  generally  appropriate  if 
the  objective  is  to  obtain  a  symmetric,  approximately  normal  distribu¬ 
tion  of  data.  Reckhow  and  Chapra  (1983,  Chap.  6)  show  how  the  appli¬ 
cation  of  the  log  transformation  to  a  data  set  simultaneously  addressed 
problems  of  nonlinearity  and  skewness.  Achievement  of  more  than  one 
objective  with  a  transformation  is  actually  not  uncommon;  the  investi¬ 
gator  should  therefore  be  encouraged  to  consider  data  transformations 
whenever  it  appears  that  they  may  improve  an  analysis. 

58.  To  illustrate  the  effect  of  a  transformation  (note  that  the 
discussion  of  data  displays  (Part  II)  also  contains  an  examination  of 
the  transformations) ,  the  chlorophyll  a  and  total  phosphorus  (TP)  data 
from  Table  1  were  plotted,  transformed,  and  then  plotted  again.  The 
logarithmic  transformation  was  applied,  since  it  is  likely  to  be  the 
most  frequently  used  transformation  in  limnological  studies.  Figure  7 
presents  the  untransformed  bivariate  plots  of  chlorophyll  a  versus  total 
phosphorus  for  the  two  samples;  Figure  8  is  a  log-transformed  plot  of 
the  data  from  Table  1. 

59.  The  effect  of  a  log  transformation  is  to  "stretch  out"  data 
on  the  left  side  of  a  plot  and  to  "pull  in"  data  on  the  right  side  of  a 
plot.  This  is  the  reason  that  the  log  transform  tends  to  create  a  sym¬ 
metric  distribution  from  data  that  are  skewed  right.  Note  that  the 
stretching-out  and  pulling-in  effects  are  observed  when  comparing  Fig¬ 
ure  8a  with  Figure  7a  in  either  a  horizontal  or  vertical  direction. 

The  effect  of  the  transformation  in  Figure  8a  is  desirable  since  the 
"bunched"  data  near  the  origin  in  Figure  7a  are  spread  out  in  Figure  8a. 
Correspondingly,  the  two  highest  chlorophyll  o.  observations  are  closer 
to  the  center  of  the  data  in  Figure  8a.  As  a  result  of  these  effects, 
the  investigator  is  likely  to  obtain  more  representative  statistical 
summaries  of  the  data  under  the  logarithmic  transformation. 

60.  In  contrast,  the  stretching  and  pulling  effects  that  favored 


the  log  transformation  above  have  an  undesirable  effect  on  the  data  in 
Figures  7b  and  8b.  Since  the  untransformed  data  in  Figure  7b  are  not 
bunched  near  the  origin  (nor  are  they  particularly  nonlinear) ,  the  log 
transform  spreads  the  data  near  the  origin  and  bunches  them  at  the  upper 
right  in  Figure  8b.  In  this  case,  the  untrans formed  data  should  lead  to 
better  statistical  summaries.  This  latter  situation  is  somewhat  unusual 
for  limnological  data;  nonetheless,  the  investigator  should  always 
examine  plots  of  the  data  to  check  on  the  effectiveness  of  a 
transformation . 


PART  IV:  ANALYSIS  OF  WATER  QUALITY  DATA 


Introduction 


61.  Data  analysis  will  generally  fit  into  one  of  two  categories, 
estimation  of  parameters  or  tests  of  hypotheses.  Since  hypothesis 
testing  is  basic  to  much  statistical  analysis ,  at  least  an  elementary 
understanding  of  the  concept  is  necessary. 

62.  Commonly  one  wishes  to  test  hypotheses  about  the  parameters 
that  have  been  estimated.  In  particular,  in  considering  the  relation¬ 
ship  between  one  variable  and  another,  one  will  often  wish  to  test 
hypotheses  concerning  the  slope  and/or  intercept  of  the  regression. 
Also,  experimentation  is  frequently  directed  toward  the  testing  of 
hypotheses  rather  than  parameter  estimation.  Both  topics,  regression 
and  experimental  design,  will  be  discussed  more  fully  in  subsequent 
sections. 


Population  parameters 

63.  One  must  first  have  an  understanding  of  what  is  meant  by  the 

term  population  parameter.  The  most  common  parameters  are  the  mean, 

2 

represented  by  y  ,  and  the  variance,  represented  by  o  .  The  mean,  or 
arithmetic  average,  is  the  most  familiar  and  commonly  estimated  param¬ 
eter.  It  is  calculated  as  the  sum  of  all  the  individual  observations 


(Y_^)  in  the  population  divided  by  the  total  number  (N)  of  observations 
in  the  population. 


where 

N 

Y 

i 


the  size  of  the  population 
an  observation  from  the  population 

a  subscript  value  from  1  to  N  which  identifies  the 
individual  observation  being  summed 


F 


V 
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and  E  indicates  that  the  observations  are  to  be  summed  from  obser¬ 
vation  i  -  1  to  observation  i  -  N. 

64.  Variance  is  given  by  the  formula 

?  N 

°  “  E  (Yi  "  w)  /N 

i-1 

where  previous  definitions  apply. 

65.  Variance  is  obtained  from  the  sum  (across  all  observations  in 
the  population  from  i  **  1  to  N  )  of  the  squared  deviations  of  the 
individual  observations  (Y^)  from  the  population  mean  (U).  This  sum  of 
squared  deviations  is  then  divided  by  N  ,  giving  an  "average  squared 
deviation."  If  an  individual  observation  deviates  greatly  from  the 
mean,  the  variance  will  be  high.  If  the  observations  deviate  only 
slightly  from  the  mean,  the  variance  will  be  low. 

66.  It  will  be  noted  that,  in  determining  the  population  vari¬ 
ance,  the  deviations  were  squared.  Clearly  some  of  the  individual 
values  will  fall  below  the  mean,  giving  a  negative  deviation,  and  the 
remainder  will  fall  above  the  mean,  giving  a  positive  deviation.  With 
regard  to  sign,  the  deviations  will  thus  always  sum  to  zero.  The  sum  of 
the  actual  deviations  is,  therefore,  not  useful  as  a  measure  of  vari¬ 
ability,  and  some  method  of  considering  only  the  size  of  the  deviations, 
without  regard  to  their  sign,  is  needed.  There  are  two  obvious  alterna¬ 
tives,  the  squares  of  the  deviations  and  the  absolute  values,  both  of 
which  are  always  positive.  The  square  of  the  value  has  some  theoretical 
advantages  over  the  absolute  value,  so  statistical  calculations  of 
variability  usually  employ  the  variance  as  defined  above. 

67.  The  variance  is  the  average  squared  deviation  of  the  individ¬ 

ual  observations  from  the  mean  of  the  population.  It  is  reasonable  to 
convert  this  value  to  the  same  scale  as  the  observations  by  taking  the 
square  root.  This  is,  indeed,  what  is  usually  done.  The  resulting 
value  is  called  the  standard  deviation. 

68.  The  calculation  described  earlier  for  the  variance  was  the 
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sum  of  the  N  squared  deviations  in  the  whole  population,  which  was 

then  divided  by  N  .  When  dealing  with  a  sample  from  a  population,  one 

usually  defines  the  sample  variance  as  the  sum  of  the  squared  deviations 

from  the  sample  mean  divided  by  one  less  than  the  sample  size.  Thus,  if 

the  sample  size  is  n  ,  the  sum  of  the  squared  deviations  is  divided  by 

n-1  (called  the  degrees  of  freedom). 

69.  The  population  deviation  was  calculated  as  Y  -  u  »  where  p 

is  known  for  the  population  and  does  not  have  to  be  estimated.  For  a 

sample,  the  deviation  is  calculated  as  Y^  -  Y  ,  which  uses  the  sample 

mean  Y  .  Since  Y  is  calculated  from  the  same  sample  used  to  calcu- 

2 

late  the  sample  variance  (s  ),  one  degree  of  freedom  is  lost,  and  the 
denominator  is  reduced  by  one.  The  resulting  calculation  is 


s2  -  E  (Y1  -  Y)2/(n  -  1) 


which  gives  an  unbiased  estimate  of  a  .  Using  n-1  rather  than  n 
corrects  for  the  bias  introduced  by  using  x  to  estimate  p  .  The  use 
of  n-1  instead  of  n  is  particularly  important  when  n  is  small. 
Although  it  becomes  less  important  as  n  gets  larger,  it  appears  to  be 
good  policy  always  to  use  n-1  as  the  divisor. 

Hypothesis  testing 

70.  Repeated  mention  will  be  made  of  assumptions  for  the  various 
tests  and  analyses  discussed.  All  analyses  assume  that  the  data  used 
are  drawn  at  random  from  the  target  population,  i.e.,  the  population 


about  which  inferences  are  to  be  made. 

71.  We  also  need  to  consider  the  distribution  of  the  sample  mean, 
which  exists  conceptually  rather  than  in  reality.  Specifically,  a  ran¬ 
dom  sample  from  a  population  provides  a  single  sample  mean.  A  second 
independent  sample  from  the  same  population  provides  a  second  sample 
mean.  We  could  in  theory,  if  not  in  practice,  take  an  indefinitely 
large  number  of  random  samples  of  the  population  and  thus  obtain  an 
indefinitely  large  number  of  sample  means.  These  sample  means  then  have 


their  own  frequency  distribution  (the  distribution  of  the  sample  means) 
which  can  be  determined  from  knowledge  of  the  population  distribution 
and  the  sample  size.  The  sample  variance,  likewise,  has  a  distribution, 
as  does  any  statistic  or  function  of  the  sample  observations. 

72.  A  further  assumption  common  to  the  early  discussion  of  hypo¬ 
thesis  testing  is  that  the  data  are  normally  distributed.  Tests  of 
hypothesis  have  been  developed  based  on  the  normal  distribution  because 
it  commonly  arises,  if  not  exactly,  as  an  extremely  good  approximation 
of  the  population  distribution.  Indeed,  even  if  the  distribution  from 
which  the  samples  are  drawn  is  not  normal,  the  distribution  of  the  means 
of  various  samples  will  approximate  the  normal  distribution  for  a  suf¬ 
ficiently  large  sample  size.  The  larger  the  sample  size,  the  more 
nearly  normal  the  distribution  of  the  means.  Although  the  rate  at  which 
the  distribution  of  a  sample  mean  approaches  normality  depends  on  the 
nature  of  the  population  distribution,  it  is  quite  rapid  for  most 
practical  situations,  and  the  normality  assumption  can  be  quite  viable 
for  fairly  small  samples. 

73.  Since  the  normal  distribution  plays  a  central  role,  it  is 

pertinent  that  it  be  examined  more  closely.  A  normal  distribution  can 

be  represented  by  a  bell-shaped  curve,  three  examples  of  which  are  given 

in  Fig -re  9.  Each  of  the  curves  in  this  figure  has  the  same  mean  (p  = 

2 

20),  but  different  variances  (o  =  1,  2,  and  3).  An  objective  of  a 
test  of  hypothesis  might  be  to  detect  values  that  are  unusually  large 
and  therefore  probably  do  not  belong  to  a  population  with  the  hypothe¬ 
sized  distribution.  The  taller,  narrow  curve  in  Figure  1  has  the  small¬ 
est  variance,  and  the  density,  or  occurrence  of  observations,  is  almost 
zero  below  17  or  above  23.  Therefore,  if  a  value  of  15  were  observed, 
it  is  highly  unlikely  that  the  value  belongs  in  the  narrowest  distribu¬ 
tion.  However,  it  may  well  belong  to  one  of  the  wider  distributions. 

74.  Unfortunately,  there  is  an  infinite  number  of  possible  normal 
curves  with  different  combinations  of  mean  and  variance.  A  method  of 
standardizing  the  distribution  is  therefore  necessary.  The  standardized 
normal  distribution  is  a  bell-shaped  curve  with  a  mean  of  zero  and  a 
variance  of  one  as  shown  in  Figure  10. 
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Figure  9.  Three  normal  distributions  with  a  common  mean 
(y  -  20)  but  different  variances 

75.  Examination  of  this  curve  will  show  that  most  of  the  distri¬ 
bution  (indeed  about  95  percent)  is  contained  in  the  interval  from  -2  to 
+2,  although  the  ends  taper  off  to  infinity.  Standardization  is  accom¬ 
plished  by  applying  the  formula 

Y  -  u 

Standard  normal  deviate  ■  z  ■  - - 

a 

76.  If,  for  example,  a  normal  distribution  is  given  with  a  mean 
of  20  and  a  variance  of  25,  then  would  an  observation  with  a  value  of  35 
be  a  likely  value  to  draw  from  the  distribution?  Standardization  pro¬ 
duces  a  value  of  [(35  -  20) /5]  ■  3.  Ninety-five  percent  of  the  standard 
normal  curve  falls  between  -1.96  and  +1.96,  and  99  percent  between 

-2.576  and  +2.576.  With  an  observed  value  of  35  either  (a)  the  assumed 

2 

normal  distribution  (y  *  20,  o  ■  25)  is  applicable  and  a  very  rare 
event  has  occurred,  or  (b)  the  distribution  from  which  the  observation 
was  drawn  is  other  than  that  assumed.  In  this  example  the  event  would 
appear  to  be  so  rare  that  one  would  be  prepared  to  believe  (b)  rather 
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Figure  10.  The  standardized  normal  distribution 

than  (a) .  It  is  this  type  of  reasoning  that  lies  behind  all  statistical 
hypothesis  testing. 


One-Sample  Hypotheses 

77.  The  simplest  type  of  test  involves  the  comparison  of  an 
observed  parameter  estimate  against  some  hypothesized  value.  As  an 
example,  suppose  the  objective  is  to  determine  if  a  reservoir  chloro¬ 
phyll  concentration  exceeds  some  subjective  estimate  of  trophic  state. 
Chlorophyll  concentrations  in  excess  of  10  mg/ l  are  generally  considered 
to  be  indicative  of  eutrophy.  To  test  reservoir  chlorophyll  concentra¬ 
tions  against  this  level,  the  investigator  might  obtain  samples  at 
randomly  selected  times  during  the  growing  season. 

78.  The  hypothesis  to  be  tested  should  be  stated  formally  in 
advance  of  the  study.  Hypotheses  may  be  "one-sided"  or  "two-sided."  In 
the  case  of  the  chlorophyll  concentrations  presented  above,  the  two- 
sided  test  would  have  the  null  hypothesis  stated  as  "the  chlorophyll 
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concentration  Is  10  mg/i,"  and  the  alternate  hypothesis  would  state  that 
"the  chlorophyll  concentration  is  not  equal  to  10  mg/£." 

79.  These  hypotheses  can  be  stated  mathematically  as 

NULL  HYPOTHESIS  HQ:  p  -  10 

ALTERNATE  HYPOTHESIS  H4 :  p  4  10 

A 

80.  However,  these  hypotheses  are  of  little  interest  and  are  not 

appropriate  for  use  in  this  situation.  What  is  of  interest  is  the 
magnitude  of  the  chlorophyll  concentration  with  respect  to  the  value  of 
10  mg/£.  Therefore,  the  appropriate  hypotheses  are 

NULL  HYPOTHESIS  HQ:  p  >  10 

ALTERNATE  HYPOTHESIS  H. :  p  <  10 

A 

81.  This  null  hypothesis  states  that  the  parameter  (p)  is  greater 
than  or  equal  to  10  mg/£,  which  is  indicative  of  eutrophy.  The  alterna¬ 
tive  hypothesis  states  that  the  parameter  is  less  than  10  mg/£,  or 
chlorophyll  concentrations  are  below  the  eutrophic  level.  These,  then, 
are  the  correct  hypotheses  for  the  stated  objective.  The  investigator 
is  now  faced  with  the  decisionmaking  process,  the  testing  of  the  hypo¬ 
theses.  This  will  first  require  an  estimate  of  p  ,  the  target  popula¬ 
tion  mean  to  be  tested.  Suppose  that  a  sample  of  10  observations  has 
been  taken  during  the  growing  season  and  that  the  mean  is  11.09  mg/£. 

Can  the  reservoir  be  considered  eutrophic?  This  value  is  above  the 
boundary  of  10  mg / £,  but  first  examine  the  raw  data  values  below. 

RAW  DATA  VALUES:  5.2,  6.3,  4.1,  13.2,  35.7, 

3.5,  3.4,  6.0,  8.8,  24.7 

82.  Only  three  of  the  values  exceeded  the  boundary  condition  of 
10  mg/£;  the  remaining  values  were  considerably  lower.  However,  we  are 


not  interested  in  individual  observations;  we  wish  to  test  the  estimate 
of  the  population  parameter  (b)  against  10  mg/*-.  Even  with  most  of  the 
individual  values  below  10  mg/^,  it  is  possible  that  the  actual  popula¬ 
tion  value  is  greater  than  10  mg/&. 

83.  An  appropriate  method  for  the  evaluation  of  the  observed  mean 
is  by  way  of  the  Student  t-statistic,  the  value  of  which  is  given  by 


where 

Y  =  $  ,  the  estimated  (indicated  by  the  "hat," A)  value  of  the 
parameter  (i.e.,  the  sample  mean) 

U  ■  hypothesized  value  (10  rag/£  in  the  example) 

°  2 

Sy  =  square  root  of  the  estimated  variance  of  the  mean  (s^) 
i  2  * 

(see  below  for  the  calculation  of  s^) 

Note:  the  sample  mean  follows  a  distribution  which  has  its  own  mean  and 
2 

variance;  s^  is  a  sample-based  estimate  of  the  latter. 

84.  The  statistic  provides  a  measure  of  the  size  of  the  differ¬ 
ence  between  the  measured  and  hypothesized  value  relative  to  the  vari¬ 
ability  of  the  mean.  If  the  t  value  is  large  enough,  the  null 
hypothesis  may  be  rejected;  in  this  case,  the  difference  is  said  to  be 
"significant." 

85.  The  problem  with  simply  looking  at  the  mean  is  that  it  is 
impossible  to  know  the  characteristics  of  the  distribution  unless  pre¬ 
vious  work  has  been  done  on  the  parent  population.  How  large  a  differ¬ 
ence  is  large  enough  to  indicate  that  the  difference  is  significant? 

The  advantage  with  the  t  statistic  is  that  the  characteristics  of  its 
distribution  are  known,  provided  that  the  population  distribution  is 
normal.  Before  continuing,  the  t  distribution  should  be  examined. 

86.  The  t  distribution  is  described  by  a  bell-shaped  curve,  simi¬ 
lar  to  the  normal  curve  discussed  earlier.  The  curve  is  symmetrical, 
highest  in  the  center,  and  tapers  on  either  end.  The  curve  is  centered 
on  zero,  and  about  0.66  of  the  total  area  under  the  curve  (66  percent) 
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is  contained  in  an  interval  one  standard  deviation  unit  above  and  below 
the  center  (the  standard  deviation  is  a  measure  of  the  variability) .  An 
Interval  from  two  standard  deviation  units  below  the  center  to  two 
standard  deviation  units  above  the  center  again  contains  about  0.95  (or 
95  percent)  of  the  total  area.  The  difference  between  the  t  distribu¬ 
tion  and  the  standard  normal  distribution  is  that  there  is  but  one 
standard  normal  distribution.  There  are  many  t  distributions,  one  for 
each  possible  sample  size.  For  very  large  samples,  the  t  and  standard 
normal  distributions  are  virtually  identical,  but  the  t  distribution  is 
wider  for  smaller  samples. 

87.  Just  as  important  as  the  area  in  the  center  of  the  distribu¬ 
tion  is  the  area  in  the  tail  (or  tails)  of  the  distribution.  Values 
that  fall  outside  the  two  standard  deviation  units  from  the  mean  only 
occur  about  5  percent  of  the  time,  so  they  are  relatively  rare  events, 
not  expected  to  occur  very  often  by  random  chance.  The  t  table. 

Table  7,  gives  the  values  of  t  that  will  be  exceeded  with  specified 
probability  p  . 

88.  If  a  t  value  were  calculated  for  a  sample  of  size  10,  the 
sample  would  have  9  degrees  of  freedom  (d.f.),  and  the  appropriate 
tabular  values  would  be  obtained  from  the  line  corresponding  to  d.f. 

■  9  .  For  a  two-sided  (two-tailed)  hypothesis  test,  a  t  value  of  1.833 
or  larger  would  be  found  by  random  chance  1/10  of  the  time,  a  value  of 
2.262  or  larger  would  be  found  1/20  of  the  time,  and  a  value  of  3.250  or 
more  only  1/100  of  the  time.  If  we  were  to  calculate  a  t  value  and  find 
it  was  2.3,  then  either  the  sample  is  an  unusual  one  (occurring  only 
about  5  percent  of  the  time  by  random  chance)  or  the  true  mean  of  the 
population  is  other  than  hypothesized.  We  may  thus  be  prepared  to  con¬ 
clude  that  the  null  hypothesis  is  false.  In  so  doing,  we  would  err 
about  once  in  20  tests,  if  we  always  employed  a  probability  level  of 
0.05. 

89.  The  above  demonstrates  the  central  theme  in  understanding 
tests  of  hypotheses.  Even  if  the  values  come  from  an  underlying  popula¬ 
tion  in  which  the  actual  mean  is  equal  to  or  greater  than  10,  it  is 
still  possible  to  observe  individual  values  that  are  less  than  10.  It 


43 


I 


Two  Tails 


d.f. 

0.10 

0.05 

1 

6.314 

12.706 

2 

2.920 

4.303 

3 

• 

• 

4 

• 

0 

5 

• 

0 

6 

• 

• 

7 

• 

• 

8 

• 

• 

9 

1.833 

2.262 

10 

1.812 

2.228 

20 

• 

• 

30 

• 

• 

CO 

• 

• 

Probability  Level 


One  Tail 

0.01 

0.10 

0.05 

0.01 

63.657 

3.078 

6.314 

31.821 

9.925 

• 

• 

• 

1.886 

• 

• 

• 

2.920 

• 

• 

• 

6.965 

■ 

• 

• 

• 

• 

• 

3.250 

• 

• 

• 

1.383 

• 

• 

• 

1.833 

• 

• 

• 

2.821 

3.169 

• 

• 

• 

1.372 

• 

• 

• 

1.812 

• 

• 

• 

2.764 

• 

• 

• 

is  even  possible  that  the  sample  mean  will  be  less  than  10,  although 
this  would  be  less  common. 

90.  For  example,  we  now  evaluate  the  t  value  for  the  data  set 
given  above  on  chlorophyll  values,  for  testing  against  the  standard 
value  of  10.0. 

91.  The  sample  mean  and  variance  are  calculated  first,  thus 


Y  =  ZYi/n  =  110.9/10  -  11.09 
EY*  =  2,279.61 

92.  The  sum  of  squared  deviations  is  given  by 


Eyi  “  E(Yi  "  Y)2  "  EYi  "  ^ZYi)2/n  *  2,279.61  -  (110.9)2/10  -  1,049.73 


and  the  variance  by 


Eyi/(n-l) 


1,049.73/9  -  116.64 


The  variance  of  the  means  is  not  the  same  as  the  population  variance. 
Means  are  expected  to  be  much  less  variable  than  the  individual  observa¬ 
tions  in  the  population.  Indeed,  the  variance  of  the  means  of  samples 
of  size  n  equals  the  population  variance  divided  by  n  and  may  be 
estimated  as  the  sample  variance  divided  by  n  ;  thus. 


s^  ■  s2/n  ■  116.64/10  ■  11.664 


The  standard  error,  used  for  the  t  test,  is  the  square  root  of  the 
variance  of  the  means.  The  t  statistic  is  then  calculated  as 


Y  -  V 


o  11.09  -  10 
3.415 


0.319 


93.  This  value  is  then  compared  to  the  value  in  the  table  of  t 
values.  The  critical  value  selected  depends  on  the  degrees  of  freedom 


V*V 
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and  the  probability  of  error  selected.  If  a  probability  of  error  of 
0.05  were  selected  for  this  one-tailed  example  and  since  there  are 
9  d.f.,  the  critical  value  would  be  1.833  (one-tailed);  so,  the  test 
statistic  above  would  not  fall  outside  the  range  of  normal  variation. 

The  hypothesis  that  these  values  could  have  come  from  a  population  with 
a  mean  greater  than  or  equal  to  10.0  could  not  be  rejected  on  the  basis 
of  the  available  data. 

94.  There  is  one  other  point  that  should  be  made  about  hypothesis 
testing.  In  doing  the  test  above,  the  probability  of  error  that  was  set 
was  for  one  type  of  error,  called  Type  I  error,  or  a  (alpha)  error. 

This  error  probability  is  the  "probability  of  rejecting  a  null 
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hypothesis  which  is  true."  In  the  example  above,  the  null  hypothesis 
was  y  £  10.0  ,  which  Indicates  compliance.  Therefore,  the  probability 
of  erroneously  rejecting  this  hypothesis  and  concluding  that  the  samples 
indicated  noncompliance  was  required  to  be  at  most  0.05  or  5  percent. 

95.  There  is,  however,  another  type  of  error,  called  a  Type  II  or 
g  (beta)  error.  This  is  the  probability  of  accepting  a  null  hypothesis 
which  is  false.  In  the  above  example,  since  the  null  hypothesis  was  not 
rejected,  there  is  a  possibility  of  a  8  error.  Unfortunately,  the 
probability  of  a  8  error  depends  on  the  magnitude  of  the  deviation  from 
the  null  hypothesis,  which  is  unknown.  Nevertheless,  the  t  test  will, 
if  the  assumptions  are  met,  minimize  the  probability  of  a  8-type  error. 
Therefore,  for  the  data  available  and  under  the  assumptions  previously 
stated,  the  t  test  is  considered  a  "powerful"  test  because  it  does  have 
the  smallest  probability  of  g  error. 

Confidence  Intervals 

96.  It  has  been  pointed  out  that  the  sample  mean  can  be  regarded 
as  having  a  distribution  although  usually  we  will  have  only  one  observa¬ 
tion,  i.e.,  one  sample  mean,  from  that  distribution.  Further,  the 
standard  error  plays  the  same  role  with  respect  to  the  distribution  of 
the  sample  mean  as  the  standard  deviation  does  to  the  individual  pop¬ 
ulation  values.  Recall  that  in  the  case  of  a  normal  distribution, 

95  percent  of  the  population  values  are  contained  within  an  interval 
approximately  two  standard  deviations  above  and  below  the  population 
mean — the  actual  value  is  1.96  standard  deviation  units.  We  might, 
therefore,  expect  to  be  able  to  determine  an  interval  about  the  sample 
mean,  as  a  multiple  of  the  standard  error,  which  would  contain  95  per¬ 
cent  of  the  possible  sample  means,  which  are  distributed  about  the 
population  mean.  Looked  at  in  another  way,  this  is  equivalent  to  deter¬ 
mining  an  interval  that,  with  high  probability,  say  0.95,  would  contain 
the  population  mean.  We  refer  to  this  as  a  95-percent  confidence  inter¬ 
val  for  the  mean;  it  is  a  measure  of  the  precision  of  the  sample  mean. 

97.  Because  we  are  dealing  with  the  sample,  we  cannot  determine 


such  an  Interval  as  simply  as  with  the  population,  given  normality  of 

distribution.  Recall,  however,  that  the  frequency  distribution  of 

t  ■  is  known,  where  y  is  here  the  actual,  but  unknown, 

population  mean.  We  can  thus  readily  determine,  usually  by  reference  to 

tables,  a  value  t  such  that 
o 

P(t  2:  t  )  -  a/2 
o 

where  a  is  a  specified,  usually  small  value,  say  0.05. 

98.  That  is 


99.  On  rearranging,  this  is  equivalent  to 

P(Y  -  tQs^  *  y)  -  a/2 

100.  Likewise  we  can  determine  t^  such  that 


101.  Indeed  t,  *  -t  ,  i.e.. 


which  on  rearrangement  gives 


pvwm  w wwji 


That  is, 

P  (M  s  Y  -  t  ss)  -  oi/2 

O  I 

P  (U  *  Y  +  t  Sr:)  -  a/2 

0  I 

so  that 

P  (Y-tss^MY+t  Sn)  =  1  -  (a/2-a/2) 

o  I  oi 

=  1  -  a  =  0.95 


with  a  *  0.05. 

102.  Thus,  Y  ±  tQS^  provides  a  confidence  interval  for  the 
mean.  Loosely,  we  say  that  the  probability  that  the  population  mean  is 
contained  within  this  interval  is  1  -  a  (■  95  percent  if  a  *  0.05  ). 
What  we  are  really  saying,  however,  is  that  intervals  computed  this  way 
will,  on  the  average,  cover  the  population  mean  95  times  out  of  100. 

The  probability  statement  is  about  the  interval  and  not  the  population 

Y  -  u 

mean  which  is  a  fixed,  but  unknown,  quantity.  Note  that  t  -  — - 

depends  only  on  the  sample  and  the  unknown  parameter;  such  a  statistic 
is  called  a  "pivotal"  quantity. 

103.  The  calculations  will  be  demonstrated  using  the  previous 
values  for  the  chlorophyll  concentration  example.  Degrees  of  freedom 
are  the  same  as  previously,  i.e.,  d.f.  *  9  .  However,  since  confidence 
intervals  are  usually  placed  symmetrically  above  and  below  the  mean 
value,  a  two-sided  t  value  is  needed  for  the  9  d.f.  (t  ■  2.262).  The 
calculations  proceed  as  follows: 


Y  -  11.09 

s|  -  s2/n  =  11.664 


s-  -  3.415 


I'M 


Lower  95-percent  confidence  limit  (Lp 


LL  -  Y  -  ts-  -  11.09  -  2.262(3.415)  -  3.37 


Upper  95-percent  confidence  limit  (L2) 


L2  -  Y  +  ts^  -  11.09  +  2.262(3.415)  -  18.81 


104.  In  the  sense  indicated  above,  we  may  feel  95  percent  sure 
that  the  true  value  of  y  ,  the  population  mean,  lies  within  the  inter¬ 
val  3.37  to  18.81.  There  is,  of  course,  on  the  average,  a  5-percent 
chance  that  the  true  value  of  y  falls  outside  the  interval. 


Two-Sample  Hypotheses 


105.  In  many  situations  it  is  necessary  to  compare  two  means 
rather  than  simply  test  a  mean  against  a  hypothesized  value.  Examples 
Include  the  comparison  of  two  reservoirs,  two  areas  in  the  same  reser¬ 
voir,  or  the  same  area  at  two  different  times  of  the  year.  Obviously, 
such  comparisons  require  two  samples  and  lead  to  two  sample  hypotheses. 
As  with  one-sample  hypotheses,  two-sample  hypotheses  may  be  two-sided  or 
one-sided. 


Two-sided: 


H  :y.  »  y.  or  H  :y.  -  y0  -  0 

O  i  2  O  1  2 


H.:y.  i  y_  or  H.:y.  -  y0  +  0 
A  1  2  A  1  2 


One-sided: 


H  :y.  s  y  or  H  :y.  -  y.  £  0 
0  12  0  12 


H.:y.  >  Uo  H.:n,  -  y_  >  0 
A  1  2  A  1  2 
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These  hypotheses  can  be  tested  using  the  t  test.  The  t  test,  as  before, 
requires  an  assumption  of  randomly  selected  samples  from  normal  popula¬ 
tions  and,  strictly  speaking,  also  requires  an  assumption  of  homogeneity 
of  variance  (i.e.,  variances  of  the  two  populations  are  equal). 

106.  The  F  test  is  used  to  test  the  equality  of  two  population 
variances.  It  also  arises  as  a  test  for  differences  between  two  or  more 
means  in  an  analysis  of  variance  (see  below).  If  both  populations  are 
normally  distributed  with  equal  variances,  the  distribution  of  the  ratio 
of  the  sample  variances  is  known;  it  is  called  the  F  distribution. 

107.  Unlike  the  normal  and  t  distributions,  the  F  distribution  is 
asymmetric.  Since  two  variances  are  estimated,  there  will  be  degrees  of 
freedom  for  the  numerator  and  denominator  of  the  ratio,  which  will  not 
necessarily  be  the  same. 

108.  Specifically, 


where 

2 

8^  *  one  variance  estimate  with  DF  *  n^ 

2 

s2  *  second  variance  estimate  with  DF  = 

109.  The  F  test  most  commonly  tests  the 


-  1 


hypotheses 


H 

o 
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The  two-tailed  F  test  is  commonly  calculated  with  the  larger  variance  in 
the  numerator.  This  is  because  tables  of  the  F  statistic  are  often  used 
in  conjunction  with  the  analysis  of  variance,  in  which,  perforce,  the 
test  is  one-tailed.  Given  this  convention,  the  F  statistic  becomes 
larger  and  larger  as  the  two  variances  become  more  different.  There¬ 
fore,  a  significantly  large  F  value  indicates  the  inequality  of 
variances,  whereas  a  small  F  value  indicates  the  equality  of  variances. 


Given  the  assumption  of  equal  variances  (a^  ■  02)  we  can  write 


0Y  -Y 
1  2 


<3^  In.  +  c^/n,. 


113.  In  order  to  estimate  c  it  is  necessary  to  estimate 

2  *12 
a  .  Given  the  assumption  of  equal  variances ,  both  samples  can  be  used 

to  calculate  a  pooled  variance  ^s^  j  to  estimate  a ^  . 


(mA  +  mA)/  (DFi 


+  df2) 


where 

DFj  -  degrees  of  freedom,  sample  1 

sj  *  variance  of  sample  1 
DF2  -  degrees  of  freedom,  sample  2 

s2  *  variance  of  sample  2 

114.  The  pooled  variance  is  then  used  to  calculate  the  variance 
of  the  difference  between  the  means 


The  standard  error  of  the  difference  between  the  means  is  the  square 
root  of  the  variance  estimate 


SY  -Y 
*1  2 
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and  has  degrees  of  freedom  equal  to  the  sum  of  the  degrees  of  freedom  of 
the  two  samples 


115.  The  calculated  t  statistic 


t  -  (y1  -  y/s*^ 


Is  then  compared  to  the  critical  value  of  t  selected  from  the  t  table. 

A  sample  t  statistic  larger  than  the  critical  value  would  lead  to  the 
rejection  of  the  null  hypothesis  of  equality  of  means. 

116.  As  an  example  of  the  use  of  the  two-sample  F  and  t  tests, 
suppose  that  phosphorus  concentrations  have  been  measured  at  random  from 
two  areas  In  the  same  reservoir,  and  the  sample  means  and  variances 
calculated.  These  sample  estimates  are  given  below. 


Number  of  samples  (n) 


Mean,  ug/ml  (Y) 


Variance  (s  ) 


117.  Prior  to  sampling  we  have  no  basis  for  a  hypothesis  about 
which  area  has  higher  concentrations  or  more  variation.  Therefore,  a 
two-tailed  test  will  be  employed  for  both  the  F  test  and  the  t  test. 
The  hypotheses  are 


u  2  2  u  2  ,  2 

Ho:oA  "  °B  HA:oA  *  °B 


Ho:yA  "  VB  HA!lJA  *  yB 


118.  The  F  statistic  is 


F  -  s^/8g  -  45/20  -  2.25 


For  a  5-percent  level  test,  the  critical  value  from  the  table  with 
degrees  of  freedom  6  (numerator)  and  10  (denominator)  is  4.07.  (One 


must  check  whether  the  table  Is  designed  for  one-  or  two-tailed  tests.) 
This  value  was  not  exceeded,  so  we  conclude  that  2.25  Is  not  an  unusu¬ 
ally  large  F  value  and  that  the  assumption  of  equal  variances  can  be 
accepted. 

119.  Since  the  assumption  of  equal  variances  is  acceptable,  the 
two  estimates  may  be  combined  into  a  single  estimate  of  the  pooled  vari¬ 
ance,  given  by 


2 

s 

P 


(mA  +  dVb  )/(dfa  +  dfb> 

[6(45)  +  10(20)1  _  470  r 


This  estimate  of  the  combined  variance  also  has  degrees  of  freedom  equal 
to  the  sum  of  the  degrees  of  freedom  for  the  two  components  (6  +  10  * 

16) .  It  can  then  be  used  to  calculate  the  standard  error  for  the  t  test 
of  the  two  means. 


4a-Yb  ’  (%/oa)  +  (Bp/nB ) 


=  (29.375/7)  +  (29.375/11) 

-  6.687 

Sy  y  =  2.620 

A  B 


120.  The  two-sample  t  statistic  is  then 


This  value  would  be  compared  to  the  tabular  t  value  for  16  d.f.  which  is 
2.120  for  a  probability  of  type  I  error  of  a  ■  0.05  .  The  assumption 
of  equal  phosphorus  concentrations  would  thus  be  rejected  by  a  5-percent 
level  test. 

121.  One  problem  with  the  F  test  is  that  it  is  much  more  depen¬ 
dent  on  the  normality  assumption  that  the  t  test.  Rejection  of  the  null 
hypothesis  (equal  variances)  may  be  due  to  a  violation  of  this  assump¬ 
tion  rather  than  a  difference  in  the  population  variances. 

Regression 

122.  Regression  is  a  type  of  statistical  analysis  that  is  used  to 
express  and  quantify  the  relationship  between  two  variables.  Its  appli¬ 
cation  usually  requires  the  estimation  of  two  or  more  parameters  of  a 
target  population.  Regression  analysis  also  involves  hypothesis 
testing.  Any  time  a  regression  is  performed,  there  is  an  implicit 
hypothesis  that  the  slope  is  not  zero.  A  zero  value  for  the  slope 
Implies  that  there  is  no  relationship  between  the  variables. 

Linear  regression 

123.  In  a  quantitative  analysis  of  data,  relationships  are  often 
observed  between  two  variables.  One  objective  of  statistical  analysis 
is  to  express  these  relationships  quantitatively  and,  often,  to  test  the 
magnitude  of  the  relationship,  if  any,  between  the  two  variables. 

124.  The  population  regression  line  follows  the  average  of  one 
variable  (Y)  at  unique  values  of  the  other  (X) .  In  some  cases  the 
relationship  is  obvious  and  direct.  For  example,  the  salinity  of  water 
is  often  measured  as  conductivity  because  increased  salinity  results  in 
increased  conductivity.  In  other  cases  the  relationship  is  not  as 
obvious.  There  are  many  relationships  between  water  quality  variables 
that  can  be  described  quantitatively.  The  statistical  method  used  to 
quantify  these  relationships  is  called  regression  analysis. 

125.  It  should  be  noted  that  the  demonstration  of  a  relationship 
does  not  in  any  way  imply  "cause  and  effect."  A  strong  relationship  may 
be  demonstrated  simply  because  both  variables  relate  to  a  third  variable 
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which  may  not  have  been  Included  in  the  analysis. 

126.  Figure  11  demonstrates  the  relationship  measured  between  two 
variables.  Pairs  of  values  (X  and  Y)  are  measured  as  sample  values. 
There  Is  an  obvious  tendency  for  Y  to  Increase  as  X  Increases,  indi¬ 
cating  that  there  Is  some  kind  of  quantitative  relationship  between  the 
two  variables. 


Figure  11.  Bivariate  scatter  plot  of  X  and  Y 

127.  There  are  two  aspects  of  this  line  expressing  a  relationship 
between  the  variables  X  and  Y  which  can  be  used  to  describe  the 
relationship  quantitatively.  The  first  is  the  angle  that  the  line  makes 
with  the  X  axis,  that  is,  the  slope  of  the  line.  As  one  passes  from 

X  *  0  to  X  -  1  ,  say,  there  is  an  increase  in  Y  .  Since  the  line  is 
straight,  for  any  increase  in  X  of  one  unit,  there  is  a  constant 
increase  in  Y  .  This  "increase  in  Y  per  unit  increase  in  X  "  pro¬ 
vides  a  measure  of  the  slope. 

128.  The  second  aspect  of  the  line  needed  to  quantify  it  is  some 
measure  of  its  level.  Knowing  only  the  slope,  there  are  an  infinite 
number  of  possible  lines  which  could  be  drawn,  each  parallel  to  the 
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other.  If  we  can  define  one  point  that  the  line  must  pass  through, 
there  is  then  only  one  line  that  will  satisfy  the  conditions.  This 
point  is  usually  defined  as  the  value  of  Y  when  X  -  0  ,  or  the  point 
at  which  the  line  crosses  the  Y  axis.  This  point  is  known  as  the 
Intercept. 

129.  The  simplest  linear  regression  model  is  represented  by  the 
equation 

Y^  “  a  +  +  e± 

where 

Y^  ■  an  individual  observation  of  variable  Y 

X^  -  a  value  of  variable  X 

a  ■  a  population  parameter  for  the  intercept  of  the  regression 
line  between  X  and  Y 

$  ■  a  population  parameter  for  the  slope  of  the  regression  line 

*  a  random  variate  describing  the  deviations  of  the  observed 
points  from  the  line 

130.  The  method  usually  used  to  find  estimates  for  the  population 
parameters  for  a  linear  relationship  is  least  squares  regression.  It 
consists  of  finding  a  line  that  minimizes  the  sum  of  the  squares  of  the 
vertical  distances  between  the  points  and  the  line.  The  bars  connecting 
the  points  to  the  line  in  Figure  12  Indicate  the  distances  the  sum  of 
squares  of  which  would  be  minimized  in  a  least  squares  regression 
estimate. 

131.  Some  assumptions  are  made  whenever  a  least  squares  regres¬ 
sion  line  is  fit  to  a  data  set.  First,  we  must  assume  that  the  rela¬ 
tionship  that  is  used  is  appropriate.  If  a  straight  line  is  used,  there 
is  an  assumption  that  the  relationship  is  linear.  Other  options  exist 
and  are  discussed  later.  Several  additional  assumptions  are  given 
below. 

a.  The  units  in  the  sample  with  any  one  value  of  X  are 
randomly  chosen  from  all  units  in  the  target  population 
with  that  value  of  X  .  This  can  be  achieved  by  select¬ 
ing  the  units  at  random  from  the  entire  target  popula¬ 
tion. 
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The  values  of  Y  at  any  particular  value  of  X  are 
normally  distributed  about  the  regression  line. 

The  variance  of  the  Y  values  is  homogeneous,  that  is, 
the  variance  of  Y  is  the  same  at  each  value  of  X  . 

The  differences  between  the  points  and  the  line,  i.e., 
the  ,  are  independent;  this  can  be  achieved  by 
Independent,  random  selection  of  the  units.  Multiple 
observations  in  a  single  unit  will  not,  in  general,  be 
Independent . 

All  of  the  variation,  due  to  sampling  or  other  causes, 
occurs  in  the  Y  variable;  the  X  variable  is  measured 
without  error. 


Figure  12.  Representation  of  the  distances  to  be  minimized  by 

least  squares  regression 

132.  The  first  assumption  is  made  to  ensure  that  the  values  cho¬ 
sen  are  representative  of  the  target  population.  Such  an  assumption  is 
appropriate  for  all  statistical  analyses.  The  second  assumption  is 
necessary  only  if  hypothesis  tests  or  confidence  intervals  are  required. 
The  statistics  employed  will  then  have  t  and  F  distributions.  The  third 
and  fourth  assumptions  are  necessary  for  the  parameter  estimates  to  be 
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"best,"  i.e.,  to  be  the  most  precise  possible.  Sometimes,  It  Is 
observed  that  the  Y  values  are  more  widely  scattered  as  the  X  value 
Increases.  In  this  case  the  assumption  of  homogeneous  variance  is  vio¬ 
lated.  The  difficulty  may  be  overcome  by  using  one  of  the  transforma¬ 
tions  discussed  later. 

133.  It  should  be  noted  that  the  least  squares  fit  will  always 

yield  an  unbiased  regression  line  which  minimizes  the  vertical  dis¬ 
tances,  even  if  the  values  are  not  normally  distributed,  the  variances 
heterogeneous,  and  the  not  independent.  Therefore,  the  estimated 

values  may  be  useful.  The  hypothesis  tests  and  confidence  intervals 
would  not  be  correct,  however,  if  the  normality  assumption  is  not  met. 

134.  The  fifth  assumption  is  necessary  because  the  linear  regres¬ 
sion  techniques  used  minimize  the  vertical  distance.  No  consideration 
is  made  for  variation  in  X  .  If  the  assumption  is  not  met,  the 
estimates  will  be  biased  although  the  bias  may  be  negligible  if  the 
measurement  errors  are  small  in  relation  to  the  standard  deviation  of 
the  X  values  employed.  Statistical  techniques  do  exist  which  can 
address  the  problem,  but  are  not  as  easy  to  use  and  are  not  usually 
applied. 

135.  As  one  might  gather  from  the  above,  regression  analysis  is 
generally  considered  to  be  robust  against  violation  of  several  assump¬ 
tions.  This  means  that  adequate  results  can  often  be  obtained  even  when 
minor  violations  of  the  assumptions  are  made. 

Fitting  simple  linear  relationships 

136.  Fitting  the  regression  line  requires  that  estimates  of 
population  parameters  be  obtained  from  a  sample.  The  values  obtained, 
of  course,  are  not  a  and  6  (the  population  parameters),  but  rather 
a  and  b  ,  the  sample  estimates  of  the  population  parameters.  The 
resulting  equation  is  then 


Y  -  a  +  bXt  +  e± 

where  a  is  an  estimate  of  a  ,  b  an  estimate  of  8  ,  and  each  e^ 
an  estimate  of  the  corresponding  . 


t.s  i 


137.  Six  intermediate  values  must  be  obtained  from  the  sample 
data  to  fit  a  regression  relationship.  These  are: 
n — the  sample  size 

EX^ — the  sum  of  the  values  of  the  X  variable 
EY^ — the  sum  of  the  Y  values 
2 

EY^ — the  sum  of  the  squares  of  the  Y  variables 
2 

EX^ — the  sum  of  squares  of  the  X  variable 
EXjY^ — the  sum  of  the  products  of  the  X  and  Y  variables 

The  slope  of  the  fitted  line  is  calculated  as: 

b  ■  Exy/Ex2 

The  lowercase  representations  used  for  Ex  and  Exy  signify  that  these 
values  have  been  adjusted  about  the  means*  i.e.,  x  =  X-X  ,  y  -  Y-Y  . 


Thus, 


Ex2  =  E(X  -  X)2  -  EX  -  (EX) 2/n 


Exy  -  E(X  -  X)(Y  -  Y)  -  EXY  -  EXEY/n 


We  shall  later  require 


Ey2  -  E(Y  -  Y)2  -  EY2  -  (EY)2/n 


138.  The  estimated  line  will  pass  through  the  sample  mean 
(X,Y)  .  Failure  to  make  the  adjustment  will  force  the  regression  line 
through  the  origin. 

139.  Once  the  slope  is  calculated,  the  intercept  can  be  esti¬ 
mated.  This  is  obtained  as 


_  V  _  LV 
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SS  -  Z (Y  -  Y,)2  -  zy2  -  bZxy 

error  i  i 

The  total  variability  In  Y  is  the  sum  of  the  variability  accounted  for 
by  the  regression  and  the  error  variability,  or 

SS  .  .  -  *  SS  .  ,  +  SS 

corr.  total  model  error 

Therefore,  given  the  corrected  total  and  the  model  sum  of  squares,  the 
error  sum  of  squares  can  be  calculated  by  difference, 

SS  -  SS  .  .  .  -  SS  ,  , 

error  corr.  total  model 

The  SS  expresses  the  variability  due  to  differences  between  the 

error  r  .  J 

A 

sampled  values  of  Y  (Y^)  and  the  Y^  values  generated  from  the 
regression  equation. 

143.  The  proportion  of  the  total  variation  in  the  dependent 

variable  that  is  accounted  for  by  the  regression  is  called  the 

2 

coefficient  of  determination  (r  ) ,  where 

r2  -  SS  .  ./SS  . 

model  corr.  total 

2 

Values  of  r  can  range  from  0  (no  relationship  between  Y  and  X  )  to 

/N 

1  (every  value  of  Y^  lies  on  the  regression  line,  Y^  *  Y^  ). 

144.  The  model  and  error  sum  of  squares  also  provide  the  basis 
for  a  test  of  the  significance  of  a  regression.  The  test  is  based  on 
the  null  hypothesis  that  the  slope  is  zero,  or  that  no  relationship 
exists  between  Y  and  X  .  The  hypotheses  are 

H  :6  -  0  H. :8  **  0 
o  A 

145.  An  F  test  is  used  to  determine  if  the  calculated  slope  is 
significantly  different  from  zero.  In  order  to  perform  this  test  it  is 
necessary  to  calculate  the  model  and  error  mean  squares.  Mean  squares 
(short  for  mean  squared  deviations)  are  calculated  by  dividing  the  model 
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and  error  sum  of  squares  by  their  respective  degrees  of  freedom 

MS  .  .  =  SS  .  ./DF  .  . 
model  model  model 

MS  -  SS  /DF 

error  error  error 

where 


model 

DF  =  DF  .  -  DF  .  , 

error  total  model 

-  (n  -  1)  -  1 

■  n  -  2 


Note:  DFm0(jei  for  a  linear  regression  will  always  be  1. 

146.  The  F  statistic  used  to  test  the  null  hypothesis  is 


MS  .  . /MS 
model  error 


and  is  then  compared  to  the  critical  F  value  with  numerator  DF  ** 

DF  .  -  and  denominator  DF  *  DF 
model  error 

147.  A  t  test  can  also  be  used  to  test  the  null  hypothesis  of  a 
zero  slope,  as  well  as  testing  the  significance  of  the  intercept.  The 
two-tailed  hypotheses  are 


Ho:S  =0  HA : g  ?  0 

and 


H  :o  -  0  H.:a  +  0 
O  A 


148.  The  t  statistic  used  to  test  the  slope  is 


t  ”  (b  -  0)/sfe  =  b/sfe 


where 

b  -  calculated  slope 
s.  -  standard  error  of  b 

D 

149.  Similarly,  the  t  statistic  used  to  test  the  intercept  is 

t  *  (a  -  0)/s  =  a/s 

3  3 

where 

a  =  calculated  intercept 

s  =  standard  error  of  a 
a 

The  standard  error  of  b  is 


sb 

and  the  standard  error  of  a  is 


s 

a 


These  t  tests  are  evaluated  in  the  same  manner  as  for  those  t  tests 
previously  discussed. 

150.  Use  of  the  t  test  also  allows  for  the  testing  of  one-sided 
hypotheses  about  the  slope.  Either 

H  :8  s  c  H. : 8  >  c 
o  A 


or 


H  : 6  S  c  H. : $  <  c 
o  A 


can  be  tested  using  a  t  statistic.  Remember  that  the  F  test  only 


allowed  for  the  two-sided  hypothesis.  The  t  statistic  used  for  these 
tests  is  calculated  as 


t  "  (b  -  c)/s^ 

151.  As  an  example  of  a  regression  analysis,  suppose  that  an 
investigator  wishes  to  quantify  the  relationship  between  precipitation 
in  a  subwatershed  of  the  reservoir  and  the  mean  daily  discharge  from 
this  area  into  the  reservoir.  The  precipitation  may  be  measured  at  one 
site  in  the  subwatershed  and  the  discharge  (in  1,000  m/day)  could  be 
determined  from  gage  data.  Sample  data  are  given  below. 

Precipitation  (X)  Discharge  (Y) 


0.0 

19.3 

0.2 

20.5 

1.3 

27.4 

1.7 

25.7 

2.5 

34.1 

3.2 

50.4 

6.1 

68.4 

152.  Note  that  a  point  has  been  included  which  has  zero  precipi¬ 
tation.  This  is  to  illustrate  the  fact  that  some  discharge  is  expected 
even  when  there  is  no  precipitation. 

153.  It  is  obvious  from  the  data  that  there  is  some  relationship 
between  the  precipitation  in  the  basin  and  discharge  into  the  reservoir 
The  calculations  now  proceed  as  described.  The  six  intermediate  values 
required  are 

IX  =  15 
IY  =  245.8 
IX2  =  58.32 
IY2  -  10,585.52 
IXY  -  747.18 
n  ■  7. 


154.  The  next  step  is  to  calculate  the  adjusted  sums  of  squares 


and  products  needed  for  the  calculation  of  the  slope,  and  the  mean 
values  needed  to  calculate  the  intercept. 


Ex2  -  EX2  -  (EX)2/n  =  58.32  -  (15)2/7  -  26.177 
Exy  -  EXY  -  (EXEY) /n  -  747.18  -  15(245.8) /7  -  220.466 
Y  -  EY/n  -  245.8/7  -  35.114 
X  -  EX/n  -  15/7  -  2.143 


We  shall  later  require 

Ey2  -  E(Y  -  Y)2  -  EY2  -  (EY)2/n  «=  10,585.52  -  (245. 8)2/7  -  1,954.429 


155.  The  final  calculations  are  then  made  to  obtain  estimates  for 
the  slope  and  intercept. 

b  -  Exy/ Ex2  -  220.466/26.177  -  8.422 


a  „  Y  -  bX  «  35.114  -  8.422(2.143)  -  17.067 


156.  The  resulting  regression  line  is  shown  in  Figure  13.  The 
slope  is  positive,  indicating  that,  as  would  be  expected,  Y  (the  dis¬ 
charge)  increases  as  X  (the  precipitation)  increases. 

157.  The  sums  of  squares  are 


SS 

corr. 


total 


Ey2  -  1,954.429 


SSmodel  "  bExy  "  (8.422) (220.466)  -  1,856.765 


SS 

error 


SS 

corr. 


total 


SS 


model 


-  1,954.429  -  1,856.765  -  97.664 
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PRECIPITATION 


Figure  13.  Linear  regression  of  discharge  and  precipitation 
Therefore,  the  coefficient  of  determination  is 


r  -  SS  .  ,/SS  _  _  . 
model  corr.  total 


-  1,856.765/1,954.429 


The  value  of  r  indicates  that  0.95  or  95  percent  of  the  variation  in 
the  dependent  variable  is  accounted  for  by  the  regression. 

158.  The  F  statistic  used  to  test  the  null  hypothesis  that  the 
slope  is  zero  is 


F  -  MS  .  ,/MS 

model  error 
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V 


where 


wrrT7  '.T-Tj.-'r. 


MS  .  .  =  SS  .  ./ DF  .  .  -  1,856.765/1  -  1,856.765 
model  model  model 

MS  =  SS  /DF  =  97.664/5  -  19.533 
error  error  error 


F  ■=  1,856.765/19.533 
=  95.059 


The  critical  value  of  F  ,  with  a  =  0.05  ,  numerator  DF  =  1  ,  and 
denominator  DF  -  5  ,  is  10.0.  The  calculated  F  statistic  is  much 
larger  than  the  critical  value  and,  as  a  result,  the  null  hypothesis  can 
be  rejected. 

159.  The  t  statistic  used  to  test  the  same  hypothesis  (HQ:g  *=  0) 


t  -  (b  -  o)/t 


=  8.422  / •,  /MS  /j* 

/  y  error 

=  8.422  /  /l9. 533/26. 177 
=  9.750 

The  critical  value  of  t  ,  with  a  m  0.05  and  DF  ■=  5  ,  is  2.571.  As 
with  the  F  test,  the  null  hypothesis  can  be  rejected. 

160.  Regression  lines  not  only  quantify  a  relationship,  but  also 
allow  for  the  estimation  of  the  average  value  of  Y  at  any  specified 
value  of  X  ,  whether  or  not  the  X  value  was  observed.  For  example,  no 
precipitations  of  5  cm  were  observed,  but  the  regression  relationship 
can  estimate  the  mean  discharge  for  a  precipitation  of  5  cm.  This  would 
be  calculated  as 

Y  -  a  +  bX  -  17.067  +  8.422(5)  -  59.177 

161.  The  regression  line  can  also  be  used  to  estimate  values 
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outside  the  range  of  observed  data.  In  the  example,  the  greatest 
observed  precipitation  was  6.1  cm.  However,  the  discharge  can  be 
calculated  for  a  greater  value,  7  cm,  10  cm,  or  any  other  number.  For 
example,  the  estimated  discharge  resulting  from  10  cm  of  rainfall  is 

Y  -  a  +  bX  -  17.067  +  8.422(10)  -  101.287 

Caution  must  be  employed,  however.  There  is  a  tacit  assumption  that  the 
same  linear  relationship  applies  outside  the  range  of  the  data.  This 
may  or  may  not  be  the  case  and,  unfortunately,  cannot  be  tested  unless 
the  range  of  the  data  is  extended. 

162.  It  is  also  important  to  understand  that  the  closer  one  moves 
to  the  extremes  of  the  data,  the  less  precise  are  the  estimates  of  the 
mean  discharge.  The  most  precise  estimate  is  the  one  that  occurs  at  the 
mean  of  the  X  values,  namely  Y  (since  a  simple  linear  regression  line 
always  passes  through  the  sample  means) .  The  precision  is  even  less 
outside  the  range  of  the  data,  even  if  extrapolation  of  the  line  is 
valid.  Thus,  the  precision  of  the  estimate,  or  the  confidence  in  the 
predictive  ability,  decreases  as  the  distance  from  the  mean  of  the  X 
values  increases.  This  can  be  graphically  illustrated  by  a  confidence 
interval  about  the  regression  line. 

Confidence  intervals  for  regression 

163.  The  regression  line  is  an  estimate  of  the  true  situation  for 
a  population,  and  it  can  be  given  a  confidence  interval.  Since  the 
estimate  is  a  line,  the  confidence  interval  is  a  band  about  the  line. 

It  has  been  pointed  out  that  estimation  becomes  less  precise  the  farther 
one  moves  away  from  X  .  This  is  reflected  into  the  confidence  interval 
bands  which  are  narrowest  (closest  to  the  regression  line)  at  the  point 
where  X  -  X  and  become  wider  in  either  direction  moving  away  from  X 
(Figure  14). 

164.  Confidence  intervals  for  the  regression  line  can  be  cal¬ 
culated  as  an  interval  at  any  point  on  the  regression  line  and  are  then 
connected  to  give  the  confidence  bands.  The  equation  for  the  confidence 
interval  at  any  value  of  X  ,  say  Xq  ,  is 
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PRECIPITATION 


Figure  14.  Confidence  intervals  for  the  regression  of  discharge 

and  precipitation 

Upper  limit  “  Yq  + 

Lower  limit  ■  Y  -  t  C 
o  si 


a  +  bX^  (predicted  Y  value) 
standard  error  of  the  predicted  Y  values 

critical  value  of  t  with  degrees  of  freedom  equal  to  those 
of  s£ 

/\ 

The  standard  error  of  Y  at  X  is 
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Y  y  error  l  n  o  J 

and  has  n  -  2  degrees  of  freedom.  It  should  be  obvious  that  the 

standard  error  is  at  a  minimum  for  X  ■  X  and  will  increase  as 

o 

estimates  are  made  at  values  of  X  farther  from  the  mean. 
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165.  Confidence  limits  are  often  expressed  as 


The  upper  and  lower  confidence  limits  are 


59.177  ±  ts~ 

-  59.177  ±  2.571(2.980) 

-  59.177  ±  7.662 


Example  with  computer  output 

168.  Many  applications  of  statistics  employ  a  computer  to  obtain 
the  results  of  the  calculations.  It  will  be  useful  to  examine  typical 
output  in  order  to  see  how  the  values  are  likely  to  be  presented.  The 
computer  program  used  is  the  General  Linear  Models  (GLM)  procedure  from 
the  Statistical  Analysis  System  (SAS).  The  example  is  taken  from  data 
on  the  percent  organic  content  and  percent  clay  of  sediment  samples  from 
Red  Rock  Reservoir  (Gunkel  et  al.  1984).  The  data  used  in  the  study  are 
given  in  Table  8,  and  the  computer  output  of  a  simple  linear  regression 
is  given  in  Table  9. 

169.  The  computer  summary  in  Table  9  is  subdivided  into  five  sec¬ 
tions.  The  first  section  gives  values  for  the  MODEL,  ERROR,  and  the 
CORRECTED  TOTAL  sums  of  squares.  The  MODEL  is  simply  the  sum  of  squares 
attributable  to  the  simple  linear  regression  line  (**  bExy)  and  has  a 
single  degree  of  freedom.  In  multiple  regressions  and  other  types  of 

analysis,  the  MODEL  may  have  many  more  degrees  of  freedom.  The  ERROR  is 

2 

the  sum  of  squared  deviations  from  the  regression  line  (■  Ey  -  bExy) 

2 

and  provides  a  measure  of  random  error  (i.e.,  Ee^).  The  CORRECTED  TOTAL 

sum  of  squares  is  the  total  sum  of  squares  adjusted  for  the  mean 
2 

(-  Ey  ) .  These  statistics  provide  a  basis  for  a  test  of  the  simple  lin¬ 
ear  regression  model.  If  there  were  no  linear  relationship,  i.e., 

8  ■  0,  the  MEAN  SQUARE  for  the  MODEL  would  also  estimate  the  variance  of 
the  .  Thus,  under  the  null  hypothesis,  the  MEAN  SQUARE  for  the  MODEL 

should  equal  the  MEAN  SQUARE  for  ERROR.  The  ratio  of  this  MEAN  SQUARE 
for  the  MODEL  to  the  MEAN  SQUARE  for  ERROR  would  then  follow  an  F  dis¬ 
tribution,  if  the  e  were  normally  distributed.  The  F  value  is  here 


Observation 


28 

29 

30 

31 


Table  8  (Concluded) 


Sample 

Transect  Station  No. 


Percent  Percent  Organic 
Clay  Matter 


Table  9 

Example  of  Computer  Output  for  Simple  Linear  Regression 
(Red  Rock  Data  -  Organic  Content  Regressed  on  Clay, 
Percent  -  General  Linear  Models  Procedure) 


Section  1 

DEPENDENT  VARIABLE:  ORGANIC 


SOURCE 

DF 

SUM  OF  SQUARES 

MEAN  SQUARE 

F  VALUE 

MODEL 

1 

113.80279461 

113.80279461 

128.68 

ERROR 

46 

40.68033039 

0.88435501 

PR  >  F 

CORRECTED  TOTAL 

47 

154.48312500 

0.0001 

Section  2 

R-SQUARE 

C.V. 

ROOT  MSE 

ORGANIC  MEAN 

0.736668 

13.0047 

0.94040151 

7.23125000 

Section  3 

SOURCE 

DF 

TYPE  I  SS 

F  VALUE 

PR  >  F 

PCTCLAY 

1 

113.80279461 

128.68 

0.0001 

Section  4 

SOURCE 

DF 

TYPE  III  SS 

F  VALUE 

PR  >  F 

PCTCLAY 

1 

113.80279461 

128.68 

0.0001 

Section  5 

T  FOR  HO: 

PR  >  T 

STD  ERROR  OF 

PARAMETER 

ESTIMATE 

PARAMETERS 

ESTIMATE 

INTERCEPT 

2.50877086 

5.73 

0.0001 

0.43787001 

PCTCLAY 

0.10743080 

11.34 

0.0001 

0.00947034 

128.68.  As  described  in  the  section  on  hypothesis  testing,  one  must 
assess  whether  this  is  an  unlikely  value  to  have  arisen  by  chance  alone. 
Instead  of  referring  to  statistical  tables  to  see  if  the  value  would 
occur  less  than  once  in  20  times  (a  =  0.05)  or  less  than  once  in  100 
times  (a  *  0.01)  by  chance,  the  computer  program  provides  an  exact 
solution  (Pr  >  F)  and  here  indicates  that  there  is  approximately  only 
one  chance  in  10,000  that  the  value  would  have  occurred  by  chance  alone. 
Note  that  here  the  test  is,  perforce,  two-tailed  since  under  the  alter¬ 
native  8  +  0  ,  the  expected  value  of  the  MEAN  SQUARE  for  the  MODEL  must 
exceed  the  MEAN  SQUARE  for  ERROR. 

170.  The  second  section  of  the  computer  output  presents  some 

2 

simple  summary  statistics.  The  r  value  is  presented  and  is  obtained  by 

dividing  the  MODEL  sum  of  squares  by  the  CORRECTED  TOTAL  sum  of  squares 
2 

(blxy/Ey  ) .  It  indicates  that  the  model  accounts  for  approximately 
74  percent  of  the  total  variation.  The  ROOT  MSE  is  the  square  root  of 
the  MEAN  SQUARE  ERROR  and  is  a  measure  of  the  variation  about  the 
regression  line,  analogous  to  a  standard  deviation.  The  mean  of  the 
dependent  variable  (ORGANIC  material)  is  given,  as  is  the  coefficient  of 
variation  (C.V.).  The  C.V.  gives  an  indication  of  the  amount  of  error 
which  exists  relative  to  the  mean  (C.V.  =  ROOT  MSE/MEAN)  *  100  percent. 

171.  The  next  two  segments  of  the  output  of  this  particular  com¬ 
puter  procedure,  viz,  GLM,  gives  two  types  of  sums  of  squares  (Type  I  SS 
and  Type  III  SS) .  For  simple  linear  regression,  these  sums  of  squares 
are  equal  and  provide  no  new  information.  They  will,  however,  be  dis¬ 
cussed  in  a  subsequent  section. 

172.  The  last  component  of  the  output  gives  the  parameter  esti¬ 
mates,  both  the  intercept  (b  )  and  the  coefficient  (b.)  of  the  indepen- 

o  1 

dent  variable  (PCTCLAY) .  These  can  then  be  used  to  form  the  estimated 
regression  equation. 

-  2.5088  +  0.1074  (PCTCLAY) 

This  equation  provides  a  method  of  estimating  mean  organic  content  from 
percent  clay  content  of  the  substrate.  The  output  also  provides  the 
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standard  error  for  each  of  the  parameter  estimates,  t  statistics  for 
testing  the  hypothesis  that  each  parameter  equals  zero,  and  the 
probability  that  the  values  would  be  equaled  or  exceeded  by  chance. 
Observe  that  the  value  of  t  for  testing  the  slope  (11.34)  is  the 
square  root  of  the  F  statistic  (128.68),  i.e.,  the  tests  are  equivalent. 
Multiple  regression 

173.  The  concepts  applicable  to  simple  linear  regression  extend 
to  multiple  regression.  The  purpose  is  still  to  quantify  a  relationship 
between  a  dependent  response  variable  and  some  independent  variable. 
However,  in  multiple  regression  there  is  more  than  one  independent 
variable.  Since  there  may  exist  interrelationships  between  the 
so-called  Independent  variables,  the  regression  coefficients  for  the 
dependent  variables  from  a  multiple  regression  are  not  the  same  as  would 
be  obtained  if  the  dependent  variable  was  regressed  on  each  independent 
variable  separately. 

174.  The  formulas  for  a  multiple  regression  will  not  be  derived 
in  this  manual.  For  simple  two-factor  (i.e.,  two  independent  variables) 
multiple  regression,  formulas  are  available  in  some  textbooks,  but  broad 
application  of  multiple  regressions  requires  matrix  algebra.  We  will 
presume  that  the  calculations  will  be  done  on  a  computer  and  shall 
therefore  concentrate  on  the  interpretation  of  computer  output.  Com¬ 
puter  table  and  graphic  output  presented  in  this  chapter  were  obtained 
from  SAS  procedure  GLM  (SAS  (1981)  GRAPH  User's  Guide;  SAS  (1982)  User's 
Guide) . 

175.  An  example  of  multiple  regression  data  is  given  in  Table  10. 
These  data  represent  levels  of  a  pollutant  measured  at  various  distances 
downstream  from  a  source.  The  pollutant  decreases  with  distance  from 
the  source  of  pollution,  but  additional  factors  are  expected  to 
Influence  the  results  of  measurements  taken  on  different  occasions. 
Therefore,  two  other  independent  variables  have  been  included  in  the 
multiple  regression — temperature  and  discharge.  The  model  is  then 


Observation 


Pollutant 


Distance,  m 


Discharge 

3, 

m  /sec 


Temperature,  °C 


1 

15.5 

1,000 

24 

0.8 

2 

12.9 

1,000 

20 

1.0 

3 

14.8 

1,000 

19 

1.3 

4 

10.3 

1,000 

25 

1.8 

5 

10.7 

1,000 

15 

2.0 

6 

14.9 

2,000 

17 

0.5 

7 

6.6 

2,000 

20 

1.0 

8 

9.5 

3,000 

21 

1.0 

9 

5.1 

3,000 

15 

2.0 

10 

7.4 

4,000 

15 

0.5 

11 

11.9 

4,000 

24 

0.8 

12 

5.4 

4,000 

21 

1.0 

where  ,  and  X^  refer  to  distance,  temperature,  and  dis¬ 

charge,  respectively. 

176.  Examination  of  the  computer  output  (Table  11)  will  illu¬ 
strate  several  concepts.  The  output  first  indicates  that  the  analysis 
was  performed  on  a  dependent  variable  called  POLLUTNT.  The  sources  of 

variation  are  listed  as  MODEL  and  ERROR,  which  will  sum  to  the  CORRECTED 
2 

TOTAL  (=  £y  )  as  before.  Note  that  the  model  now  has  3  d.f.,  one  for 
each  of  the  Independent  variables.  Since  there  were  12  observations, 
the  CORRECTED  TOTAL  carries  11  d.f.,  one  being  lost  when  the  variables 
are  adjusted  about  their  means.  Three  degrees  of  freedom  were  required 
for  the  three  variables  in  the  model,  leaving  eight  as  a  measure  of  ran¬ 
dom  error.  The  computer  provides  calculations  of  the  sum  of  squares  for 
each  source,  and  the  corresponding  mean  square.  An  F  test  is  calculated 
for  the  mean  square  of  the  model,  using  the  mean  square  error.  For  the 
example  in  Table  11,  the  F  test  indicates  that  the  regression  does 
account  for  a  significant  portion  of  the  total  variation.  This  F  test 
is  a  joint  test  of  all  three  of  the  independent  variables  and  does  not 
indicate  which  one(s)  contribute  significantly  to  the  description  of 
POLLUTNT. 


Table  11 

Computer  Output  for  Multiple  Regression  Example 
(General  Linear  Model  Procedure) 


DEPENDENT  VARIABLE:  POLLUTANT 


SOURCE 

DF 

SUM  OF  SQUARES 

MEAN  SQUARE 

F  VALUE 

MODEL 

3 

97.73161461 

32.57720487 

4.84 

ERROR 

8 

53.82505206 

6.72813151 

PR  >  F 

CORRECTED  TOTAL 

11 

151.55666667 

0.0331 

R-SQUARE 

C.V. 

ROOT  MSE 

POLLUTANT  MEAN 

0.644852 

24.9011 

2.59386420 

10.41666667 

SOURCE 

DF 

TYPE  I  SS 

F  VALUE 

PR  >  F 

DISTANCE 

1 

54.19739726 

8.06 

0.0219 

TEMP 

1 

7.19969359 

1.07 

0.3312 

DISCHARG 

1 

36.33452376 

5.40 

0.0486 

SOURCE 

DF 

TYPE  III  SS 

F  VALUE 

PR  >  F 

DISTANCE 

1 

77.32993941 

11.9 

0.0095 

TEMP 

1 

1.69303058 

0.25 

0.6294 

DISCHARG 

1 

36.33452376 

5.40 

0.0486 

R  FOR  HO: 

PR  >  T 

STD  ERROR  OF 

PARAMETER 

ESTIMATE 

PARAMETERS 

ESTIMATE 

INTERCEPT 

17.59409833 

3.02 

0.0166 

5.82622109 

DISTANCE 

-0.00225161 

-3.39 

0.0095 

0.00066415 

TEMP 

0.11241376 

0.50 

0.6294 

0.22409610 

DISCHARG 

-3.78579922 

-2.32 

0.0486 

1.62909000 

177.  That  portion  accounted  for  by  the  model  is  given  by  the  r 

2 

value,  which  is  the  ratio  of  the  MODEL  sum  of  squares  (=  , 

the  bj  being  the  estimate  of  the  )  to  the  CORRECTED  TOTAL  sum  of 
squares  (£y^).  Thus  65.49  percent  of  the  variation  was  accounted  for  by 
the  model.  There  are  several  important  points  to  be  made  about  these 
calculations.  Usual  practice  is  to  adjust  about  the  mean  (i.e.,  to  fit 
an  intercept  not  necessarily  equal  to  zero),  but  there  are  some  excep¬ 
tions.  An  option  in  the  computer  algorithm  will  provide  results  not 

2 

adjusted  about  the  means,  but  the  r  value  then  does  not  have  the  same 
interpretation . 

178.  Following  the  synopsis  of  the  model  are  the  results  of  the 
individual  independent  variables  for  the  regression.  As  before,  two 
types  of  sums  of  squares  are  provided,  but  in  the  case  of  multiple 
regression  they  differ.  They  are  the  "sequential  sums  of  squares"  or, 
in  the  jargon  of  SAS  GLM,  Type  I  SS,  and  partial  sums  of  squares  or 
Type  III  SS. 

179.  The  sequential  sums  of  squares  are  commonly,  but  not  neces¬ 
sarily,  larger  than  the  corresponding  partial  sums  of  squares;  indeed. 
Table  11  provides  an  example  of  a  case  where  one  partial  sum  of  squares 
is  larger.  The  sequential  sums  of  squares  depend  on  the  order  in  which 
they  were  entered  into  the  computer  program.  Here  distance  was  entered 
first,  and  on  its  own  accounted  for  54.1974  of  the  sum  of  squares  car¬ 
ried  by  the  model.  This  is  exactly  the  same  as  would  be  obtained  if  a 
simple  linear  regression  was  done  for  POLLUTNT  on  DISTANCE,  with  no 
other  variables  in  the  model.  TEMP  was  the  second  variable  entered  into 
the  model,  so  the  variable  POLLUTNT  was  already  adjusted  for  DISTANCE 
when  TEMP  was  entered.  A  model  containing  DISTANCE  and  TEMP  carries  a 
sum  of  squares  of  61.3971  (not  given  explicitly)  so  that  contribution  of 
TEMP,  given  that  DISTANCE  is  already  in  the  model,  is  61.3971  -  54.1974 
■  7.1997.  DISCHARG  was  last  to  enter  the  model,  so  the  effect  of 
DISTANCE  and  TEMP  jointly  was  already  included.  Accordingly,  the  sum  of 
squares  carried  by  DISCHARG,  given  that  DISTANCE  and  TEMP  are  already  in 
the  model,  is  97.7316  -  61.3971  =  36.3345. 
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180.  The  partial  sums  of  squares  for  each  variable  are  obtained 
as  if  all  other  variables  were  already  included  in  the  model;  i.e.,  they 
give  the  contribution  to  the  total  sum  of  squares  carried  by  each  vari¬ 
able  adjusted  for  the  previous  inclusion  of  all  other  variables  jointly. 
Thus,  the  partial  sum  of  squares  for  DISCHARG  is  here  equal  to  the 
sequential  sum  of  squares  since  this  was  the  last  variable  entered. 
Clearly,  the  sequential  sums  of  squares  will  differ  according  to  the 
order  that  the  variables  are  entered.  The  partial  sums  of  squares  are 
independent  of  this  ordering. 

181.  The  appropriate  sums  of  squares  for  testing  hypothesis 
depend  on  the  circumstances.  These  tests  are  specific  to  the  model  so 
if  one  of  the  variables  were  to  be  omitted,  or  another  variable  added, 
the  values  would  change. 

182.  For  Instance,  in  the  example,  the  partial  sums  of  squares 
show  that  there  is  virtually  no  merit  in  Including  TEMP  in  the  model  in 
addition  to  DISTANCE  and  DISCHARG  (i.e.,  the  hypothesis  =  0  can  be 
accepted) .  The  sequential  sums  of  squares  also  show  that  there  would  be 
little  or  no  merit  in  including  TEMP  in  addition  to  DISTANCE,  but 
neither  sequential  nor  partial  sums  of  squares  provide  a  test  of  adding 
DISCHARG  to  DISTANCE  or  vice  versa.  Nor  do  we  see  how  a  model  that 
includes  TEMP  and  DISCHARG  would  perform.  For  these  we  would  have  to 
fit  two-variable  models  or  enter  the  variables  in  a  different  order. 

183.  The  final  section  of  output  from  the  computer  contains  the 
parameter  estimates,  or  regression  coefficients.  These  values,  as  with 
simple  linear  regression,  can  be  used  to  estimate  the  mean  pollution, 
given  values  of  distance,  temperature,  and  discharge.  The  equation  for 
this  example  would  be 

POLLUTNT  -  17.5941  -  0. 00225*DISTANCE  +  0.1124*TEMP  -  3. 7858*DISCHARG 

184.  Note,  however,  that  the  regression  coefficients  for  a  vari¬ 
able  are  always  calculated  with  adjustments  for  all  other  variables  in 
the  model.  The  parameter  would  have  to  be  reestimated  if  TEMP  were 
excluded. 
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Curvilinear  models 

185.  All  of  the  regression  models  demonstrated  so  far  have  been 
linear,  the  defining  feature  being  that  the  response  can  be  written  as  a 
linear  combination  of  the  parameters,  that  is  as  the  sum  of  the  param¬ 
eters  each  multiplied  by  known  quantities.  Models  with  parameters  that 
are  multiplied,  raised  to  powers,  or  are  powers  of  known  or  unknown 
quantities  are  not  linear  models.  Some  models  with  these  characteris¬ 
tics  can  be  linearized  by  transformation,  e.g.,  the  taking  of 
logarithms,  and  thus  permit  parameter  estimation  by  linear  least  squares 
methods.  There  are  also  models  that  involve  the  inverses  of  variables, 
or  powers  of  variables,  or  even  trigonometric  or  exponential  functions 
of  variables,  and  thus  appear  curvilinear,  but  since  the  parameters 
satisfy  the  defining  feature  of  linear  models  these  are  still,  strictly 
speaking,  linear  models.  Thus,  for  example 


Yi  '  “  +  Vli  +  ®2  1/X2i  +  Ei 


**  a  +  sin  +  &2e  + 


are  linear  models. 

186.  These  can  be  treated  as  cases  of  multiple  linear  regression 

x 

with,  e.g.,  -  sin  X*^,  X*^  ■  e  2i  .  The  simplest  cases  of  this  type 

are,  perhaps,  polynomials,  in  which  the  dependent  variable  Y  is 
modeled  as  the  sum  of  the  independent  variable  X  raised  to  succes¬ 
sively  greater  powers,  e.g. 


Y!  -  “  +  61X1  +  62X1  +  B3X1  +  ei 


187.  The  family  of  polynomial  curves  is  described  graphically 
below.  The  simplest  is 


and  Is  just  simple  linear  regression  as  discussed  previously.  As  shown 
in  Figures  15a  and  15b,  simple  linear  regression  may  have  either  a  posi¬ 
tive  or  negative  slope.  Next  in  the  series  is  the  quadratic 


Y±  -  a  +  8jX  +  SjXj  +  ejL 


This  will  fit  a  parabolic-shaped  curve  and  may  be  concave  (Figure  15c) 
or  convex  (Figure  15d).  Each  additional  model  in  the  series  incor¬ 
porates  an  additional  power  term  one  higher  than  the  preceding  (cubic, 
quartic,  etc.).  Thus 

cubic  Y  -  a  +  B1X±  +  8^  +  8^  +  t± 


quartic  Y  -  a  +  BjX  +  8^  +  B3xJ  +  B4X*  +  e± 


Each  additional  term  allows  for  a  possible  change  in  direction  of  the 
curve.  Thus,  high-order  polynomials  can  be  used  to  fit  rather  compli¬ 
cated  patterns  of  lines  along  some  independent  variable. 

188.  Nevertheless,  a  polynomial  equation  reveals  very  little 
about  the  underlying  nature  of  the  relationship  it  is  attempting  to 
describe.  Principally,  it  provides  a  very  flexible  method  of  fitting 
complex  curvature  and  does  allow  for  the  detection  of  pattern  and  the 
demonstration  that  portions  of  the  variation  can  be  described  by  some 
pattern.  An  example  of  the  application  of  a  polynomial  regression  based 
on  the  data  in  Table  12  is  given  in  the  computer  output  in  Table  13.  A 
plot  of  the  original  data  and  the  resultant  polynomial  is  given  in  Fig¬ 
ure  16. 


189.  The  fitting  of  a  polynomial  is  one  situation  in  which  there 
is  a  logical  ordering  for  the  entry  of  the  variable  into  the  model  (lin¬ 
ear,  quadratic,  cubic,  etc.),  so  for  testing  purposes  sequential  sums  of 
squares  are  usually  appropriate. 


Table  12 

Hypothetical  Data  for  the  Polynomial 
Regression  Example 


Observation 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 


Station 

No. 

X 

118 

1 

118 

2 

118 

3 

118 

4 

118 

5 

118 

6 

118 

7 

118 

8 

118 

9 

118 

10 

118 

11 

118 

12 

1308 

1 

1308 

2 

1308 

3 

1308 

4 

1308 

5 

1308 

6 

1308 

7 

1308 

8 

1308 

9 

1308 

10 

1308 

11 

1308 

12 

85 


O 

_  •  -  V  •  ^ 


Table  13 

Example  of  Polynomial  Regression 
General  Linear  Models  Procedure 


DEPENDENT 

VARIABLE:  OXYGEN 

SOURCE 

DF 

SUM  OF  SQUARES 

MEAN  SQUARE 

F  VALUE 

MODEL 

3 

475.46737152 

158.48912384 

136.48 

ERROR 

20 

23.22596182 

1.16129809 

PR  >  F 

CORRECTED 

TOTAL  23 

498.69333333 

0.0001 

R-SQUARE 

C.V. 

ROOT  MSE 

OXYGEN  MEAN 

0.953426 

17.9108 

1.07763542 

6.01666667 

SOURCE 

DF 

TYPE  I  SS 

F  VALUE 

PR  >  F 

X 

1 

386.56031469 

332.87 

0.0001 

X*X 

1 

9.14807859 

7.88 

0.0109 

X*X*X 

1 

79.75897824 

68.68 

0.0001 

SOURCE 

DF 

TYPE  III  SS 

F  VALUE 

PR  >  F 

X 

1 

33.26566326 

28.65 

0.0001 

X*X 

1 

70.20390107 

60.45 

0.0001 

X*X*X 

1 

79.75897824 

68.68 

0.0001 

T  FOR  HO: 

PR  >  T 

STD  ERROR  OF 

PARAMETER 

ESTIMATE 

PARAMETER-0 

ESTIMATE 

INTERCEPT 

4.69343434 

3.76 

0.0012 

1.24670068 

X 

-4.26674529 

-5.35 

0.0001 

0.79720586 

X*X 

1.08565046 

7.78 

0.0001 

0.13963081 

X*X*X 


0.0001 


0.00708021 


-0.05867651  -8.29 


Figure  16.  Polynomial  regression  of  dissolved  oxygen 
concentration  as  a  function  of  time 

190.  In  the  real  nonlinear  models,  the  parameters  are  nonaddi¬ 
tive.  These  can  sometimes  be  linearized  by  transformation  of  the 
variables,  particularly  logarithms.  Examples  of  such  models  are  given 
below. 


Semilog  model 


Nonlinear  equation 


6ixi  ' 
e  e 


loge  CYj)  -  loge  (Bo>  +  BjXj  +  et 


Linearized  equation 


Nonlinear  equation 


Y 


i 


z± 

*oXi  6 


Linearized  equation  log  (Y  )  ■  log  (3  )  +  3,  log  (X.)  +  e. 

6  1  6  O  1  e  1  1 

or  Y*  »  e*  +  e.X*  +  e. 

i  o  1  i  i 

Note  that  a  multiplicative  error  is  assumed.  This  has  been  written  as 

Gi  o  ei 

e  .  Since  e  *=  1  ,  >  0  implies  that  e  >1  and  <  0 

ei 

implies  0  <  e  <  1  . 

191.  The  semilog  model  is  commonly  associated  with  exponential 
growth  or  decay;  it  is  useful  whenever  a  dependent  variable  Y  is 
expected  to  increase  or  decrease  as  a  proportion  of  itself  over  time  or 
some  other  independent  variable  X  .  The  slope  of  the  line  3^  pro¬ 
vides  a  measure  of  the  proportional  increase  (decrease)  per  unit  of  X 
(e.g.,  Y  may  be  said  to  increase  by  0.053  or  5.3  percent  per  unit  of 
time).  It  can  be  used,  for  example,  to  describe  the  degradation  of 
chemicals  over  time  in  aquatic  systems. 

192.  The  log-log  model  is  sometimes  used  to  describe  nonlinear 
relationships  where  the  variables  X  and  Y  increase  proportionately, 
and  to  calibrate  Instruments.  The  equation  may  be  expressed  as 


i 


which,  apart  from  the  random  error,  indicates  that  the  ratio  of  Y  to 

®1 

X  is  a  constant  (3Q).  If  3L  “  1  ,  the  model  implies  that  Y  is  a 
constant  fraction  or  multiple  of  X  .  Note  that  the  taking  of  loga¬ 
rithms  is  still  appropriate  because  of  the  multiplicative  error;  the 
situation  is  not  equivalent  to 


W  J  v*--  ^  i 


,  XT- 


Yi  ‘  *oXi  +  ei 


If  6^  Is  not  equal  to  unity ,  the  relationship  is  curved.  In  such 
relationships)  if  the  error  is  not  multiplicative,  e.g.,  if 


61X1 

Yi  “  6oe  +  ei 


Yi-*oX  +el 


the  taking  of  logarithms  will  no  longer  linearize  the  equation  and 
special  nonlinear  least  squares  have  to  be  applied.  These  are  outside 
the  scope  of  this  manual.  They  are  also  required  by  the  more  complex 
models  that  cannot  be  linearized  by  any  transformation. 

Multisource  regression 

193.  We  are  sometimes  faced  with  regression  data  from  more  than 
one  source,  for  example  data  collections  from  each  of  several  lakes.  It 
may  be  of  Interest  to  know  not  simply  that  the  relationship  is  of  the 
same  form  for  each  source  but  whether  the  relationships  are  quantita¬ 
tively  equivalent.  We  shall  here  assume  that  the  relationships  are  of 
the  same  form,  l.e.,  involve  the  same  structure  of  the  variables,  and 
examine  the  second  question. 

194.  Assume  that  three  sets  of  bivariate  (X  and  Y)  data  have  been 
collected  and  each  set  of  data  has  been  fit  to  a  simple  linear  regres¬ 
sion.  The  three  regression  equations  are 


Yli  "  al  +  blXli 


i  *  1,  2. . ,n. 


Y2i  "  a2  +  b2X2i  1  "  l>  2* ’ ,n2 


Y3i  "  a3  +  b3X3i 


1  B  1,  2. . >n_ 


vVvv 


‘'vw> 


* 


Also  assume  that  all  slopes  (b^,  b 2»  and  b3>  are  significantly  different 
than  zero. 

195.  The  hypotheses  to  be  tested  are 


Ho:6l  "  ®2  ‘  ®3 


HA!81  *  h  *  83 


The  basic  calculations  necessary  to  compare  the  slopes  were  computed 
during  the  linear  regressions  for  each  data  set.  For  each  data  set  we 
need 


'k4  V*/ V  ■' 


Zx2  -  Z^  -  X)2  -  ZX2  -  (ZX)  2/n 


Zy2  -  Z(Y±  -  Y)2  -  ZY2  -  (ZY)2/n 


Zxy  **  Z(X±  -  X)(Yt  -  Y)  =  ZXY  -  ZXZY/n 


and  the  error  sum  of  squares  and  error  degrees  of  freedom 


SSerror  “  ^  “  <S*y)2/S*2 


DF  -  n  -  2 

error 


The  values  necessary  for  the  comparison  of  slopes  are  presented  in 
Table  14. 

196.  The  values  of  the  three  error  sums  of  squares  can  be  added 
to  provide  what  may  be  called  a  "pooled"  error  sum  of  squares 


with 


ss  -  ss,  +  ss,  +  ss, 

pi  Z  J 


DF  -  DF,  +  DF,  +  DF, 
pi  £5 


2  2 

The  values  of  Zx  ,  Zxy  ,  and  Zy  must  also  be  summed 


WTO? 

*  •  *is  •  »  ■* 


l  V” 


'sA 

*-  •  •  .  *  *  J 

4  %  /-.I 


V>'.V.V 


Jf.  r.'*.  <  v 
•  . »  .  •  -  • .  >. 
■  ■  *  ■. 
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Ac  "  A1  +  A2  +  A3 


Bc  "  B1  +  B2  +  B3 


c  -  c.  +  c_  +  c, 

c  1  2  3 


and  from  these  sums  a  "common"  error  sum  of  squares  can  be  calculated 


SS  -  C 
c  c 


-  (B=/Ac) 


These  calculations  are  also  presented  In  Table  14. 

197.  An  F  statistic  Is  used  to  test  the  null  hypothesis  that  the 
three  slopes  are  equal 


(SS  -  SS  )  /  (k  -  1) 
c  p 


(SS  /DF  ) 
P  P 


where  k  -  number  of  slopes  being  compared  (in  this  example  k  «  3  ). 
This  calculated  F  statistic  would  be  compared  with  a  critical  value  of 
F  having  k  -  1  numerator  degrees  of  freedom  and  denominator  degrees 
of  freedom  of  DF 

P 

198.  If  the  calculated  F  is  smaller  than  the  critical  value, 
the  null  hypothesis  (Hq:8^  “  8  2  m  83)  can  be  accepted.  If,  and  only  if, 
this  hypothesis  can  be  accepted,  It  is  possible  to  determine  if  the 
intercepts  are  equal.  Given  that  the  slopes  are  equal,  if  the  inter¬ 
cepts  are  all  equal  the  regressions  for  each  of  the  three  data  sets  are 
Identical. 

199.  The  hypotheses  to  be  tested  are 


Vi  -  “2 '  “3 


HA;“l  *  a2  *  “3 


*  ^ .  o  • 

I  .  •  > >  J* J#  •  • 

* -  >  •  ,  • - 
•/V  */ 

*  » *  +  *  *. 


•  V*  >.  V v  s 

•S-WV 

.'V>K 


Calculations  Required  for  a  Multisource  Regression 


To  perform  this  test  it  is  necessary  to  combine  the  three  data  sets  and 

2  2 
calculate  Ex  ,  Exy  ,  and  Ey 

(Ex2)t  -  EX^  -  (EXT)2/nT 

(Exy)T  =  E^  -  2XTEYT/nT 

(^y2)T  =  SY2  -  (£YT)2/nT 

where 

X^  ■  X  value  from  the  set  of  all  Xs 
Y^,  *  Y  value  from  the  set  of  all  Ys 

nT  *  nl  +  n2  +  n3 

An  error  sum  of  squares  can  be  computed  from  these  values  as 

SST  *  (Ey2)T  -  (Exy)2/(EX2)T 

200.  An  F  statistic  is  used  to  test  the  hypothesis  that  the  three 
intercepts  are  all  equal 

(SS_  -  SS  ) / (k  -  1) 

F  *  T  C 

where  SSc  ,  DF^  ,  and  k  are  as  defined  for  the  F  test  comparing 
slopes.  This  calculated  F  statistic  would  be  compared  with  a  critical 
value  of  F  having  k  -  1  numerator  degrees  of  freedom  and  denominator 
degrees  of  freedom  of  DF^  . 

201.  These  methods  extend  in  a  straightforward  manner  to  simple 
or  multiple  regressions  or  more  than  three  sources.  Indeed  one  may 
conclude  that  the  same  parameters  satisfy  some  but  not  all  sources. 
Likewise  some,  but  not  necessarily  all,  may  have  the  same  slope  (i.e., 
be  parallel)  but  be  distinct,  i.e.,  have  different  intercepts.  Any 
combination  is  possible;  usually  the  objective  is  to  find  the  most 
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parsimonious  representation  that  gives  a  satisfactory  fit. 

202.  The  case  where  one  tests  for  differences  between  intercepts, 
under  the  assumption  that  all  the  lines  are  parallel,  is  the  conven¬ 
tional  analysis  of  covariance.  There  is  a  tendency  recently,  however, 
to  refer  to  the  whole  multisource  regression  situation  as  the  analysis 
of  covariance. 


Analysis  of  Variance 


203.  As  the  terminology  implies,  an  analysis  of  variance  (ANOVA) 
is  a  procedure  for  quantifying  sources  of  variability  in  a  set  of  data. 
Of  prime  importance  is  the  realization  that  an  ANOVA  does  not  identify 
the  sources  or  components  of  variance  per  se.  The  researcher  specifies 
the  components  and  the  ANOVA  merely  assesses  the  amount  of  variability 
accounted  for  by  the  factors  which  have  been  postulated  to  explain  the 
variability  in  the  data. 

204.  The  factors  or  cause-and-ef feet  relationships  are  specified 
explicitly  in  a  model.  However,  model  formulation  is  characteristically 
different  from  many  modeling  endeavors.  A  large  number  of  modeling 
enterprises  are  typified  by  seeking  the  one  model,  from  various  classes 
of  models,  which  best  describes  the  data.  Classic  examples  are  physical 
or  mathematical  models  of  ecosystems.  The  ANOVA  is  based  on  a  single 
class  of  models  called  general  linear  models.  Within  this  class,  model 
construction  is  a  direct  consequence  of  the  sampling  or  experimental 
design  used  to  gather  the  data.  Only  after  model  formulation  does  the 
ANOVA  come  into  play.  The  ANOVA  ascertains  the  degree  to  which  model 
components  and  their  interrelationships  account  for  the  variability 
observed  in  the  data. 

205.  In  summary,  model  formulation  is  dependent  upon  the  ques¬ 
tions  posed  by  the  researcher  and  the  design  of  the  sampling  or  experi¬ 
mental  process  followed  to  obtain  the  data.  The  ANOVA  operates  as  the 
analytical  tool  for  generating  the  answers.  The  questions  asked  may  be 
exploratory  or  take  on  the  character  of  a  fact-finding  mission,  but  most 
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often  the  questions  will  be  posed  as  "a  priori"  statements  or  hypotheses 
about  what  one  expects  to  find. 

Two  hypothetical  examples 

206.  Consider  a  study  dealing  with  the  chlorophyll  concentration 
of  surface  water  samples  from  two  reservoirs.  Table  15  presents  the 
data.  The  sampling  design  involved  the  random  selection  of  five  sta¬ 
tions  located  in  the  main  pool  of  each  reservoir,  followed  by  the  draw¬ 
ing  of  three  samples  at  each  station.  The  sampling  design  may  have  been 
a  consequence  of  research  objectives  to  determine  (a)  if  reservoirs  dif¬ 
fer  in  mean  chlorophyll  concentration;  (b)  if  a  significant  degree  of 
heterogeneity  exists  among  stations  and  if  reservoirs  differed  sub¬ 
stantially  in  this  heterogeneity;  and  (c)  if  sampling  variability  within 
stations  was  relatively  high  or  low. 

207.  Based  on  these  objectives  and  the  sampling  design,  the  model 
would  be  defined.  Three  model  components  are  explicitly  required:  a 
component  reflecting  between-reservolr  mean  differences,  a  component 
dealing  with  variability  among  stations  within  each  reservoir,  and  a 
component  expressing  sample  heterogeneity  within  stations.  No  other 
components  are  provided  for  by  the  objectives  and  sampling  design.  If 
different  or  additional  objectives  had  to  be  met,  the  sampling  design 
and  model  would  have  to  be  modified  in  order  to  accommodate  the  needs  of 
the  researcher.  An  ANOVA  performed  on  these  data  would  quantify  the 
degree  to  which  model  components  explained  the  variability  inherent  in 
the  data. 

208.  Consider  a  second  example,  experimental  rather  than  sam¬ 
pling.  Assume  that  a  researcher  wanted  to  identify  the  limiting  nutri¬ 
ent  that  was  controlling  algal  productivity  during  summer  stratification 
in  the  near-dam  region  of  a  given  reservoir.  Identification  of  the 
limiting  nutrient  was  carried  out  using  the  Algal  Assay  Procedure 
(National  Eutrophication  Research  Program  1971).  This  procedure  uses 
nutrient  additions  (usually  phosphorus,  nitrogen,  and  EDTA  (ethylene- 
dlamlnetetraacetlc  acid),  separately  and  in  combination)  to  filtered 


water  samples  to  which  a  test  algae  has  been  added.  After  14  to 
21  days,  the  standing  crop  (usually  expressed  as  dry  weight/litre  or 
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Table  15 

Chlorophyll  Concentration  of  Surface  Water  Samples 


Reservoir 

Station 

Sample 

Mean 

1 

2 

3 

Station 

Reservoir 

Overall 

A1 

11.3 

9.9 

14.1 

11.8 

A2 

26.8 

29.0 

26.8 

27.5 

A 

A3 

12.9 

8.1 

12.4 

11.1 

20.9 

A4 

34.7 

32.1 

31.9 

A5 

22.1 

21.6 

22.5 

22.1 

16.0 

B1 

14.1 

15.6 

11.3 

13.7 

B2 

8.4 

9.9 

7.9 

8.7 

B 

B3 

4.7 

1.2 

3.3 

3.1 

11.0 

B4 

12.9 

11.3 

14.1 

12.8 

B5 

15.6 

18.2 

16.7 

16.8 

cells/litre)  of  the  algae  is  determined.  Under  controlled  laboratory 
conditions  of  light  and  temperature,  the  maximum  standing  crop  of  the 
algae  is  related  to  the  amount  of  limiting  nutrient  initially  available. 
The  results  of  this  experiment  are  given  in  Table  16. 

209.  The  experiment  involved  taking  24  subsamples  (eight  treat¬ 
ments  *  three  replicates  per  treatment)  from  a  single  epi limnetic  sample 
taken  from  the  near-dam  area  of  the  reservoir.  There  was  not  sufficient 
space  in  any  single  environmental  chamber  for  all  of  the  subsamples  to 
be  processed  together,  so  the  investigator  randomly  assigned  the  sub¬ 
samples  to  three  different  environmental  chambers.  The  random  assign¬ 
ment  of  the  subsamples  to  the  chambers  was  necessary  because  identical 
conditions  (i.e.,  light  and  temperature)  between  chambers  could  not  be 
assured.  Within  any  given  chamber,  all  subsamples  are  considered  to  be 
at  identical  conditions  (except  for  the  experiment  treatment  applied). 

210.  Once  the  experiment  is  designed  and  executed,  the  ANOVA 
model  is  predetermined.  For  the  example  presented  above,  the  model 
Includes  only  one  component,  the  effect  of  the  nutrient  additions.  The 
ANOVA  would  provide  the  information.  The  null  hypothesis  of  equal 
standing  crop  for  all  treatments  can  be  readily  tested.  Casual 
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inspection  of  the  means  (Table  16)  suggests  that  phosphorus  is  the 
limiting  nutrient  and  that  combinations  with  nitrogen  and  EDTA  have 
little  effect  on  the  standing  crop. 


Table  16 


Dry  Weights 

(mg/litre) 

of  the  Test  Algae 

After  14  days 

of 

Incubation 

Block 

Treatment 

Chamber 

1  Chamber  2 

Chamber  3 

Mean 

CONTROL 

7.6 

8.4 

7.8 

7.9 

+  P 

30.5 

34.2 

32.3 

32.3 

+  N 

9.6 

10.1 

9.9 

9.9 

+  EDTA 

8.4 

9.3 

10.4 

9.4 

+  P,  +  N 

33.7 

35.8 

34.2 

34.6 

+  P,  +  EDTA 

32.3 

33.1 

33.5 

33.0 

+  N,  +  EDTA 

10.9 

11.4 

11.2 

11.2 

+  P,  +  N,  +  EDTA 

34.3 

35.0 

33.8 

34.4 

Fundamentals  of  design 

211.  Although  a  complete  discussion  of  sampling  and  experimental 
design  is  beyond  the  scope  of  this  text,  a  discussion  of  some  basic 
principles  and  major  issues  is  worthwhile.  The  major  objective  of  any 
design,  sampling  or  experimental.  Is  to  generate  the  most  powerful, 
efficient,  and  accurate  results  relative  to  the  research  questions  at 
hand  within  the  constraints  imposed  by  time,  money,  and  manpower. 

212.  All  designs  (sampling  or  experimental)  leading  to  an  ANOVA 
are  characterized  by  a  random  sample  of  units  from  each  of  the  series  of 
treatment-populations.  Henceforth,  the  terminology  "treatment- 
populations"  will  be  used  in  a  generic  sense  to  refer  to  the  populations 
under  consideration  in  sampling  designs  or  the  controlled  manipulations 
employed  by  the  researcher  in  experimental  designs.  Likewise,  "sampling 
units"  and  "experimental  units"  will  be  used  to  refer  to  the  primary 


sample  of  units  from  each  treatment-population.  For  instance,  our  first 
hypothetical  example  involved  the  random  selection  of  five  stations  from 
each  reservoir  and  the  subsequent  selection  of  three  water  samples  from 
each  station.  Accordingly,  reservoirs  would  represent  treatment- 
populations,  and  stations  would  be  the  primary  sampling  units.  Water 
samples  would  simply  represent  the  process  of  secondary  sampling  or 
subsampling. 

213.  Our  second  example  is  more  complex,  but  the  complexity  pro¬ 
vides  the  flexibility  to  introduce  additional  concepts  and  issues.  It 
should  be  clear  that  the  treatment-populations  are  the  eight  experi¬ 
mental  manipulations.  The  manipulations  used  were  under  the  control  of 
the  investigator  or,  in  other  words,  the  investigator  chose  a  specific 
experimental  design.  A  design  is  the  plan  followed  in  selecting  units 
from  populations  (i.e.,  a  sampling  design)  or  in  applying  treatments  to 
units  (i.e.,  an  experimental  design). 

214.  In  summary,  all  designs  can  be  characterized  by  a  random 
sample  of  units  from  a  set  of  treatment-populations,  but  specific 
designs  are  differentiated  according  to  the  plan  followed  in  selecting 
sampling  units  from  populations  or  applying  treatments  to  experimental 
units.  Sampling  units  could  be  lakes  (units)  selected  from  different 
regions  (treatment-populations)  of  the  country,  sediment  core  samples 
(units)  drawn  at  different  locations  (treatment-populations)  of  a 
reservoir,  coves  (units)  selected  from  different  reservoirs  (treatment- 
populations),  or  water  samples  (units)  drawn  at  different  times  of  the 
year  (treatment-population)  and/or  at  different  depths  (treatment- 
population)  .  Two  examples  of  experimental  units  are  limnetic  enclosures 
(units)  supplied  with  different  nutrients  (treatments)  to  study  nutrient 
limitation,  or  littoral  stations  (units)  treated  with  different  herbi¬ 
cides  (treatments)  to  determine  effectiveness  for  macrophyte  control. 

215.  Even  though  some  might  claim  this  characterization  of 
designs  to  be  an  oversimplification,  the  fact  remains  that  the  myriad  of 
designs,  models,  and  associated  analyses  of  variance  are  based  on  the 
concept  of  a  random  sample  of  experimental  or  sampling  units  from  a  set 
of  treatment-populations.  The  complexity  arises  not  from  the 
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fundamental  process  but  from  (a)  the  procedures  followed  in  conducting 
the  entire  sampling  or  experiment  plan,  (b)  the  presumed  or  known  under¬ 
lying  structure  of  the  treatment-populations,  (c)  the  number  of  factors 
that  need  to  be  controlled  so  as  to  lead  to  valid  and  reliable  conclu¬ 
sions,  and  (d)  the  procedures  followed  in  performing  the  analysis.  The 
result  is  virtually  an  infinite  array  of  designs  and  computational 
algorithms  that  are  situation  specific  but  conceal  the  underlying 
simplicity. 

216.  Selection  of  design  cannot  be  separated  from  the  research 
objectives.  The  design  that  best  meets  the  needs  of  the  researcher  must 
be  found.  Sometimes  objectives  vary  over  a  wide  range,  and  no  design 
may  be  available  to  maximize  efficiency  for  all.  Prioritization  is  not 
negative;  certain  objectives  might  be  sacrificed,  entirely  or  partially, 
in  order  to  obtain  a  design  that  is  best  for  those  objectives  of  primary 
concern.  Moreover,  different  designs  may  be  selected  to  satisfy  dif¬ 
ferent  sets  of  objectives.  Fortunately,  all  designs  that  have  been 
developed  and  have  been  documented  are  extensions  of  two  fundamental 
building-block  designs  which  are  the  topic  of  the  next  section. 

The  two  basic  designs 

217.  The  two  studies  discussed  above  exemplify  the  two  designs  on 
which  all  designs  are  based.  These  designs  are  called  "completely  ran¬ 
domized  design"  (CRD)  and  the  "randomized  block  design"  (RBD) .  A  CRD 
may  be  defined  as  follows:  Given  a  set  of  treatment-populations,  a  sim¬ 
ple  random  sample  of  elements  is  selected  from  each  population  or  a  sim¬ 
ple  random  sample  of  units  is  evaluated  under  each  treatment. 

218.  Consider  a  study  designed  to  compare  three  reservoirs  in 
terms  of  the  total  phosphorus  concentration  of  the  surface  water  during 
a  particular  time  of  the  year.  Table  17  presents  the  data.  Sampling 
merely  involved  taking  one  water  sample  from  each  station  of  a  randomly 
selected  set  of  stations.  Note  that  the  three  reservoirs  represent  the 
treatment-populations  while  stations  represent  the  sampling  units  from 
each  cove.  The  situation  typifies  a  CRD,  or  a  simple  random  sample  of 
units  from  a  series  of  treatment-populations.  It  is  true  that  more  than 
one  water  sample  could  have  been  drawn  at  each  station.  This  added 
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Table  17 


Samples  Drawn 

at  Randomly  Selected  Set  of 

Stations 

from  Three 

Reservoirs 

Reservoir 

Station 

1 

2 

3 

1 

31 

32 

22 

2 

28 

30 

21 

3 

27 

30 

23 

4 

30 

31 

5 

34 

sampling  stage  does  not  define  the  design.  As  was  the  case  with  the 
comparative  study  on  reservoirs  (Table  15),  the  primary  stage  of  sam¬ 
pling  defines  the  design,  and  as  such  the  examples  of  Tables  15  and  17 
are  cases  of  a  CRD. 

219.  An  RBD,  the  second  basic  design,  involves  a  two-stage  pro¬ 
cess.  Initially,  the  sampling  or  experimental  units  are  grouped  into 
"blocks."  The  grouping  is  based  on  one  or  more  organismic  or  environ¬ 
mental  characteristics  which  are  presumed  to  affect  or  influence  the 
response  of  a  unit.  The  fundamental  premise  is  that  variability  within 
a  block  is  quite  small  relative  to  the  variability  between  blocks; 
homogeneity  within  versus  heterogeneity  between  is  the  rule.  Subsequent 
to  blocking,  random  sampling  occurs  as  in  a  CRD,  but  the  sampling  pro¬ 
cess  is  implemented  block  by  block. 

220.  The  experiment  reported  in  Table  16  followed  an  RBD.  The 
three  environmental  chambers  constitute  the  blocks  and  homogenization 
prior  to  subsampling  ensured  the  similarity  of  initial  nutrient  concen¬ 
trations  within  each  block  (i.e.,  low  within-block  variability).  Vari¬ 
ability  between  blocks  (due  to  differences  in  light  and  temperature 
between  environmental  chambers)  is  controlled  by  the  design. 

221.  Randomized  block  designs  have  wide  applicability.  Consider 
a  study  with  the  objective  of  assessing  water  quality  as  a  function  of 
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depth  (let  us  say  surface,  shallow,  deep).  Suppose  further  that  several 
stations  on  the  reservoir  have  been  established  as  monitoring  points. 

Now  suppose  that  during  a  particular  month,  several  samples  are  selected 
at  each  depth.  If  this  were  the  case,  then  an  RBD  would  result  where 
stations  would  constitute  the  blocks  while  depths  would  represent  the 
treatment-populations . 

222.  A  slight  variation  on  the  theme  of  depth  sampling  involves 
time  sampling  and  again  would  yield  an  RBD.  Restricting  assessments  to 
surface  water,  several  samples  might  be  drawn  each  month  for  12  months. 
Stations  would  again  constitute  the  blocks,  but  month  (time  of  sampling) 
would  represent  treatment-populations. 

223.  Some  care  must  be  exercised  in  time  and  depth  sampling. 

What  at  first  glance  appears  to  be  an  RBD  is  oftentimes  not.  In  addi¬ 
tion,  a  treatment-population  factor  in  one  instance  may  not  be  in 
another.  Regarding  the  first  point,  consider  time  sampling  of  surface 
water  on  a  reservoir.  If  water  samples  are  drawn  at  distinctly  differ¬ 
ent  stations  each  month,  a  CRD  results,  not  an  RBD.  This  is  so  because 
each  month  constitutes  the  treatment-populations  and  a  different  set  of 
stations,  not  a  replicated  set  of  stations,  is  sampled  each  time. 

224.  Regarding  the  second  point,  consider  a  comparative  study  on 
two  reservoirs.  Suppose  each  month  a  different  set  of  stations  is  sam¬ 
pled  on  each  reservoir.  The  result  would  be  an  RBD,  but  reservoirs 
would  represent  the  treatment-populations  and  month  would  constitute 
replicated  time  sampling. 

225.  These  were  just  a  few  examples  of  completely  randomized 
designs  and  randomized  block  designs.  As  the  basic  designs,  they  can  be 
combined  in  interesting  ways  to  produce  other  designs  having  wide  appli¬ 
cability  in  water  quality  and  field  research.  Of  particular  importance 
is  a  class  of  designs  that  are  called  "split-plot  designs"  (SPD).  Under 
certain  conditions  these  designs  are  called  "repeated  measures  designs." 

226.  Consider  a  combination  of  depth  and  time  sampling.  Suppose 
an  established  set  of  monitoring  stations  is  sampled  each  month  for 

12  months.  The  result  would  be  an  RBD  with  stations  as  blocks  and  time 
as  treatment-populations.  If,  in  addition,  depth  sampling  were  to  be 
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executed,  then  each  station  at  each  time  would  in  effect  be  split  as  a 
function  of  depth  in  the  water  column.  In  concept,  water  quality  mea¬ 
surement  is  repeated  for  each  station-month  combination,  but  the  repeat 
measure  is  equivalent  to  depth  sampling. 

227.  Other  examples  are  direct.  Consider  a  random  set  of  sta¬ 
tions  from  each  of  two  reservoirs  at  a  particular  time  of  the  year.  If 
this  were  the  case,  the  design  would  be  a  CRD  by  virtue  of  the  fact  that 
a  random  set  of  units  (stations)  is  selected  from  each  treatment- 
population  (reservoir).  Add  depth  sampling  at  each  station  and  the 
results  would  be  an  SPD — each  station  is  split  according  to  depth  in  the 
water  column.  Notice  that  the  only  difference  between  this  example  and 
the  previous  example  is  the  base  design.  Here  the  basic  design  is  a 
CRD,  while  the  base  design  of  the  previous  example  was  an  RBD. 

228.  In  summary,  many  designs  are  available  on  which  to  base  sam¬ 
pling  or  experimental  studies.  The  key  issue  is  to  select  the  design 
which  optimizes  the  quality  of  information  for  meeting  the  objectives  of 
the  research.  Although  only  three  designs  have  been  discussed  (CRD, 

RBD,  SPD)  and  only  a  few  examples  have  been  given,  these  three  designs 
and  related  examples  characterize  a  large  body  of  the  studies  performed 
in  water  quality  research. 

229.  In  order  to  appreciate  fully  how  designs  vary  in  maximizing 
efficiency  in  answering  research  questions,  an  understanding  of  methods 
of  data  analysis  is  required.  As  has  been  pointed  out,  the  questions  of 
the  researcher  dictate  the  design  which  in  turn  defines  the  model  with 
its  component  sources  of  variability.  The  ANOVA  quantifies  these  com¬ 
ponent  sources  so  as  to  provide  objective  criteria  on  which  to  base 
one's  inferences  and  conclusions. 

One-way  analysis  of  variance 

230.  To  develop  and  explain  the  fundamentals  of  models  and  analy¬ 
ses  of  variance,  attention  will  be  restricted  to  the  example  of  the 
three  reservoirs  and  a  random  sample  of  stations  from  each  reservoir 
(Table  17).  This  example  provides  a  minimum  level  of  complexity,  but 
within  this  level  of  complexity,  all  basic  concepts  and  principles  can 
be  demonstrated  and  explained.  Formally,  this  example  represents  a 
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CRD  with  experimental  error  and  leads  to  a  one-way  ANOVA  which  tests  the 
hypothesis  that  no  differences  exist  between  treatment  populations.  The 
precise  meaning  of  this  terminology  will  become  clear  as  the  discussion 
proceeds. 

The  model 

231.  Based  on  the  design  followed  in  gathering  the  data,  only  two 
sources  of  variability  are  explicitly  allowed,  the  variability  due  to 
differences  between  reservoirs  and  the  variability  due  to  differences 
among  stations  within  each  reservoir.  The  function  of  the  model  thereby 
becomes  one  of  expressing  total  phosphorus  (TP)  concentrations  (the 
dependent  variable)  as  a  function  of  treatment-population  parameters 
(the  independent  variables).  The  model  (typical  of  all  CRDs  with 
experimental  error  and  one-way  ANOVA)  is  given  by 


“  +  “i  + 


where 


«»TP  concentration  for  the  j  sampling  unit  (surface  water 

L  J  ^  ^ 

sample  at  the  j  station)  from  the  i  treatment-population 
(reservoir) 

y  -  overall  mean  TP  concentration 


■  effect  of  i  treatment-population 
"  wi  "  p 

y^  *  mean  TP  concentration  for  the  i  treatment-population 

■  residual  or  random  error  effect 

"  Xij  -  *i 

232.  The  model  in  Equation  1  expresses  the  dependent  variable 
(X^j)  as  an  additive  partition  of  a  sequence  of  terms  (y,  a^,  e^j). 
Within  the  context  of  an  ANOVA,  all  terms  have  a  unique  meaning.  The 
terms  are  linear  functions  of  treatment-population  mean  parameters,  y 
and  y^  (i.e.,  the  parameters  are  not  in  any  nonlinear  form  such  as 
logarithmic  or  exponential).  Hence,  ANOVA  models  are  called  "population 
linear  additive  models."  The  model  includes  only  a  simple  dependent 


variable  (X^)  which  is  functionally  related  to  the  sequence  of  terms 
(p,  a^,  e^).  Models  with  the  feature  of  one  dependent  variable  and 
multiple  independent  variables  form  the  basis  of  "univariate  analyses  of 
variance."  Models  involving  multiple  dependent  variables  and  multiple 
independent  variables  lead  to  "multivariate  analyses  of  variance" 
(MANOVAA)  and  will  not  be  considered  in  this  work. 

233.  The  ANOVA  is  designed  to  partition  the  total  variability 
into  its  component  parts.  The  terms  in  the  model  define  the  partitions. 
Expressing  the  model  in  Equation  1  in  a  slightly  different  fashion 
(Equation  2) ,  the  partitioning  accomplished  by  an  ANOVA  and  its  inter¬ 
pretation  are  fairly  direct  and  easy  to  comprehend. 


xu  -  “  -  “i  +  £ij 


=  (p1  -  p)  +  (X^  -  p±) 


According  to  the  models  in  Equations  1  and  2,  the  total  deviation  (X^  - 
p)  is  decomposed  into  two  additive  partitions,  namely  “  p^  -  p  and 
=  X^j  -  p^  .  These  two  partitions  give  rise  to  the  decomposition  of 
the  total  variability.  The  two  partitions  are  appropriately  called 
(a)  the  variability  due  to  or  explained  by  between-treatment-population 
effects  and  (b)  the  variability  due  to  the  heterogeneity  which  exists 
withi  each  treatment-population.  Alternative  and  more  commonly  used 
terminology  is  "the  treatment-population  source  of  variance"  and  "the 
residual  or  error  source  of  variance,"  respectively. 

23A.  The  ANOVA  is  based  on  the  component  sources  as  identified  in 
the  model,  and  the  essence  of  an  ANOVA  is  simple.  As  the  treatment- 
populations  become  more  distinct,  the  treatment-population  source  of 
variance  accounts  for  a  larger  portion  of  the  total  variance  relative  to 
that  which  is  accounted  for  by  the  residual,  error  on  within-treatment- 
population  variance.  In  Figure  17  the  treatment-population  becomes  more 
distinct  as  a  function  of  increased  mean  (p^)  separation.  In  Figure  18 
the  treatment-populations  become  more  distinct  as  a  function  of  reduced 
error  variance. 
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235.  The  treatment-population  distributions  in  Figures  17  and  18 

conceptually  represent  all  possible  TP  concentrations  (X..)  for  surface 

th 

water  samples  from  the  i  reservoir.  Consequently,  u  refers  to  the 
overall  mean  TP  concentration  across  all  three  reservoirs.  Similarly, 
the  total  variance  across  all  three  reservoirs  represents  the  failure  of 
all  observations  to  be  the  same  and  equal  to  the  overall  mean  (y) . 
Therefore,  it  is  the  totality  of  all  deviations  (X^  -  y)  in  the  model 
in  Equation  2  which  gives  rise  to  the  total  variance. 

236.  In  the  two-sample  t  test,  where  it  is  assumed  that  the 

variances  of  the  two  sampled  populations  are  equal,  the  common  popula- 

2 

tion  variance  o  was  estimated  by  the  pooled  variance 

s^  -  (ss1  +  ss2)/(df1  +  df2) 


where 

SS^  sum  of  squares,  sample  i 
DF^  =  degrees  of  freedom,  sample  i 
The  ANOVA  also  assumes  the  equality  of  variance 

2  2  2 
Oj  -  o2  -  ...  -  ofe 

and  the  estimated  population  variance  common  to  all  k  groups  is  given 
by 


where 


s2  =■  SS  /DF 
p  error  error 


SS 


error 


k 

l 

i-1 


n. 


I  (X  -  X. ) 2 
J-l  J 


DF  -  N  -  k 

error 


The  quantities  SJ>error  and  DFerror  are  oFten  referred  to  as  the 

error  sum  of  squares  and  the  error  degrees  of  freedom,  respectively. 

2 

The  pooled  variance  s  is  the  best  estimate  of  the  variance  common  to 

P 

all  k  groups. 

237.  To  test  the  null  hypothesis  (no  differences  exist  between 
the  k  groups) ,  it  is  necessary  to  determine  the  amount  of  variability 
between  the  k  groups.  This  variability  is  given  by  the  group  sum  of 
squares 


SS 

groups 


£  n  (X  -  X) 
i-1 


and  is  associated  with 


DF  -  k  -  1 

groups 


degrees  of  freedom. 

238.  It  is  also  necessary  to  consider  the  total  variability  in 
the  data  collected,  which  is  given  by 


with 


SS 


total 


k  ‘i 

£  £  (X..  -  X) 

i-1  j-1  2 


DF 


total 


N  -  1 


degrees  of  freedom. 

239.  The  sums  of  squares  given  above  can  be  more  easily  calcu- 


The  error  sum  of  squares  is  calculated  by  difference 

error  total  groups 

240.  To  complete  the  calculations  required  for  the  one-way  ANOVA 
it  is  necessary  to  divide  the  groups  and  error  sums  of  squares  by  their 
respective  degrees  of  freedom: 

MS  -  SS  /DP 

groups  groups  groups 

MS  -  SS  /DF 

error  error  error 

Dividing  a  sum  of  squares  by  the  degrees  of  freedom  results  in  a  vari¬ 
ance  which  is  called  a  mean  square  (MS)  in  the  ANOVA.  Mean  square  is 
short  for  "mean  squared  deviations  from  the  mean."  Table  18  presents  a 
summary  of  the  calculations  required  for  a  single-factor  ANOVA. 

241.  Hypothesis  testing  with  an  ANOVA  is  based  on  the  ratio  of 
the  two  sources  of  variance 

F  -  MS  /MS 

groups  error 

The  interpretation  is  direct.  The  larger  the  F  value,  the  more  distinct 
the  treatment-population  distributions  become  and  a  greater  amount  of 
the  total  variability  is  accounted  for  or  explained  by 
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treatment-population  (group)  differences.  If  the  calculated  F  value  Is 

at  least  as  large  as  the  critical  F  value  (with  numerator  DF  «  DF 

**  groups 

and  denominator  DF  -  DF  ) ,  then  the  conclusion  that  the  treatment- 

error 

populations  are  not  equal  can  be  made.  If  the  calculated  F  Is  less 
than  the  critical  value,  the  conclusion  Is  not  warranted. 

242.  Tables  19  and  20  present  the  computational  steps  Involved  In 

the  one-way  ANOVA  for  the  TP  data  from  the  three  reservoirs.  Based  on 

the  observed  F  value  (31.24),  the  null  hypothesis  of  no  differences 

can  be  rejected.  However,  the  exact  meaning  of  an  F  test  In  an  ANOVA  Is 

Inherently  tied  to  what  the  mean  squares  (MS  ,  MS  )  estimate. 

groups  error 

Residual  (error)  effects 
and  the  mean-square  error 

243.  All  analyses  of  variance  reduce  to  situations  Involving  a 
random  sample  of  units  from  a  series  of  treatment-populations.  The 
dependent  variable  represents  the  property  of  the  units  which  Is  pur¬ 
portedly  Influenced  by  the  treatment-population  to  which  the  unit 
belongs.  As  such,  the  dependent  variable  is  expressed  as  a  function  of 
the  treatment-population  parameters.  But  in  drawing  Inferences  about 
these  population  parameters  based  on  sample  data  and  the  results  of  an 
ANOVA,  certain  assumptions  are  made  about  the  dependent  variable.  The 
assumptions  that  typify  virtually  all  analyses  of  variance  are  given  in 

Equation  3,  which  states  that  the  X.  values  are  distributed  normally 

2 

and  independently  about  a  mean  with  variance 

X±J  ~  NID  (p±,  oj)  (3) 

244.  These  assumptions  say  that  each  observation  comes  from  a 
normal  population  with  a  particular  unknown  mean  and  variance. 

Naturally,  the  parameters  are  unknown,  but  sample  estimates  are  avail¬ 
able.  However,  when  Inferences  are  based  on  the  sample  data  (Table  17) 
and  the  ANOVA  (Table  20),  an  additional  assumption  is  made.  The 

wlthln-treatment-population  variances  are  assumed  to  be  equal  to  a 

2  2 

common  value,  denoted  -  o  for  all  1  .  The  assumptions  of  Equa¬ 
tion  3  thereby  become  those  of  Equation  4. 


Table  19 

Calculations  for  a  One-Way  ANOVA  Using  the  Data  from  Table  17 


Reservoir  Reservoir  Reservoir 


k 

A-  £ 

r1 

L-  X  -  339.0 

Drtot.l 

l-l 

j-1  13 

total 

k 

B  -  £ 

"i  - 

£  x‘  -  9,769.0 
j-1  13 

DF 

1-1 

groups 

J 

c  -  Z 

(\  Y 

1 - <—  -  9,744.8 

DF 

error 

B  -  (A  /N)  -  192.25 


groups 


C  -  (A‘/N)  -  168.05 


groups  DF 


84.03 


groups 


.rrj„  “  SS  -  SS  -  24.20 

error  total  groups 


error  DF 


m 


iwwvwmiinniiniinwvuw.m  w 


; 

Table  20 

Analysis  of  Variance  on  the  Data  of  Table  19 


Source  of  Variation 


SS 


DF 


MS 


Total 

Reservoirs  (i.e.,  groups) 
Error 


192.25 

168.05 

24.20 


11 

2  84.03 

9  2.69 


F  - 


M^groups  84.03 

MS  “2.69 

error 


31.24 


o  -  0.05  F2  9  ”  4*26 


X±  ~  NID  (ult  a2)  (4) 

245.  The  consequences  of  this  last  constraint  are  Immediate. 

Since  each  treatment-population  has  the  same  variance,  each  sample 

variance  (see  Table  18)  estimates  the  common  value.  In  turn,  since  all 

sample  variances  estimate  the  common  variance,  the  sample  variances  can 

be  pooled  to  provide  a  combined  sample  estimate  of  the  within- 

treatment-population  variance.  It  is  precisely  this  combined  or  pooled 

sample  estimate  of  the  within-treatment-population  variance  which  Is 

called  the  mean-square  error  (MSg)  in  the  AN0VA.  The  idea  of  pooling  is 

simple  and  can  be  shown  using  the  data  of  Table  19.  After  pooling 

(i.e.,  adding)  the  individual  sum-of-squares  (i.e.,  SSe  *  0.1875  + 

0.272  +  0.14  ■  0.5995),  divide  the  pooled  sum-of-squares  by  the  pooled 

degrees  of  freedom  (i.e.,  df  *  3  +  4  +  2  ■  9)  to  generate  the  mean- 

e 

square  error  (MSe  ■  0.5995/9  *  0.0666). 

246.  Since  the  assumption  of  equality  of  variances  (Equation  4) 

refers  to  within-treatment-population  variability  and  since  the  devia¬ 
tions  of  the  models  in  Equations  1  and  2  yield  the  within- 

treatment-population  variance,  the  assumptions  of  Equation  4  are 
typically  stated  in  terms  of  the  residual  (error)  component  of  the  model 
as  given  in  Equation  5,  which  states  that  residual  (error)  effects  are 
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distributed  normally  and  Independently  about  a  mean  0  and  common  error 

variance.  The  moral  is  that  the  MS  estimates  the  common  within- 

2  e 

treatment-population  variance  (a^).  Hence.  Equation  6  says  that  the 

2 

expected  value  of  the  MSg  is  o£  . 

e±i  -  NID  (o,  ah  (5) 

E  (MS  )  -  a2  (6) 

e  e 

Treatment-population 
effects:  fixed  or  random? 

247.  Often  a  researcher  faces  a  dilemma  in  deciding  which  of 
several  or  many  treatment-populations  he  should  study  intensively.  The 
researcher's  dilemma  may  be  phrased  as  follows:  "Does  the  researcher 
want  to  know  about  differences  between  the  three  reservoirs  and  only  the 
three  reservoirs  that  were  sampled?"  or  "Does  the  researcher  want  to 
draw  inferences  about  all  reservoirs  in  the  region  based  on  the  subset 
of  three  reservoirs?"  If  the  answer  to  the  first  question  is  "yes," 
the  researcher  is  dealing  with  fixed  effects.  If  the  answer  to  the 
second  question  is  "yes,"  the  researcher  is  dealing  with  random  effects. 

Expected  mean-squares  for 
fixed  effects  (Model  I  ANOVA) 

248.  Fixed  effects  refer  to  a  set  of  treatment-populations  which 
is  finite  either  by  nature  or  by  an  "a  priori"  selection  process;  more 
importantly,  inferences  are  limited  to  only  those  treatment-populations 
that  are  sampled.  If  the  researcher  samples  all  or  some  of  the 
treatment-populations  and  if  the  researcher's  concern  is  with  the  sam¬ 
pled  treatment-populations  and  only  those  evaluated,  the  model  in  Equa¬ 
tion  1  is  a  fixed  effects  model,  and  the  analysis  of  variance  is  called 
a  Model  I  ANOVA.  Consequently,  for  the  data  of  Table  17  and  its  analy¬ 
sis  (Tables  19  and  20),  if  the  researcher's  interest  lies  solely  in  the 
three  sampled  coves,  then  cove  effects  would  be  considered  fixed. 

249.  Due  to  the  nature  of  fixed  effects,  the  inferential  process 
conventionally  involves  either  or  both  estimation  of 


treatment-population  means  or  mean  differences  and  tests  of  hypotheses 
about  significant  differences  among  means.  Fixed  effects  mean-squares 
are  consistent  with  the  nature  of  the  inferences.  The  expected  mean- 
square  for  the  treatment-population  source  of  variance  is  written  in  the 
form  of  Equation  7,  and  its  meaning  and  interpretation  are  linked  to  the 
quadratic  portion  of  the  expected  mean-square  (Equation  8)  and  the 
hypothesis  (H^)  of  conventional  Interest.  Thus, 

E(MSa)  «  a\  +  Q(a)  (7) 

where  Q(a)  is  a  function  of  the  treatment-population  effects  (a  * 
reservoir  effects) 


Q(a) 


involves 


P 

2 

i-1 


(8) 


Hgjall  treatment-population  means  are  equal  or  all  treatment- 

population  effects  are  zero  (9) 

250.  Since  the  F  value  is  the  ratio  of  the  MS„  to  the  MS  , 

a  e 

the  interpretation  of  the  F  test  statistic  is  direct.  If  the  null 
hypothesis  (H^sthe  hypothesis  of  no  differences)  is  true,  then  (a)  all 
treatment-population  distribution  and  means  would  be  superimposed  and, 
as  such,  equal  to  each  other  and  equal  to  the  overall  mean;  (b)  all 

effects  would  be  equal  to  zero  and,  as  such,  Q(»)  would  be  equal  to 

2 

zero;  (c)  the  E(MS„)  *  o  ■  E(MS  )  ,  and  (d)  the  researcher  would 
expect  the  F  value  to  be  close  to  one.  Conversely,  if  the  null  hypothe¬ 
sis  was  not  true  (l.e.,  all  means  were  not  equal),  then  (a)  the 
treatment-population  effects  would  show  larger  and  larger  deviations 
from  zero  as  the  treatment-population  distributions  become  more  dis¬ 
tinct;  (b)  Q(o)  would  show  a  corresponding  Increase;  and  (c)  the 
researcher  would  expect  the  F  value  to  get  larger  and  larger  since  the 

E(MS  )  would  Increase  relative  to  the  E(MS  )  . 
a  e 


251.  If  the  F  value  reached  the  selected  significance  level,  the 
inference  would  be  that  significant  differences  exist  among  the  means. 

A  natural  question  of  the  researcher  would  be,  "Which  means  are  differ¬ 
ent  from  each  other?"  The  issue  is  that  an  ANOVA  and  its  F  test  do  not 
lead  to  the  inference  that  all  means  are  different  from  each  other.  A 
significant  F  test  merely  claims  that  differences  do  exist.  Specific 
questions  about  particular  means  and  mean  differences  (i.e.,  post-ANOVA 
type  questions)  require  techniques  which  supplement  the  ANOVA.  These 
techniques  are  the  subject  of  a  later  section. 

Expected  mean-squares  for 
random  effects  (Model  II  ANOVA) 


252.  Random  effects  are  considered  by  many  to  be  conceptually 

more  elusive  than  fixed  effects.  This  should  not  be  so.  Random  effects 

refer  to  a  random  sample  of  treatment-populations.  Consider  the  dilemma 

of  the  researcher  regarding  which  reservoirs  to  study  intensively.  If 

the  researcher  wished  to  infer  something  about  the  variability  among  all 

reservoirs  across  an  entire  region,  then  he  certainly  will  not  be  able 

to  study  all  reservoirs.  He  could,  however,  draw  a  sample  of  reservoirs 

at  random,  compute  the  variance  of  the  reservoir  effects  sampled 
2 

(denoted  s  )  and  let  this  sample  variance  estimate  the  variance  of  the 

01  2 
entire  population  reservoir  effects  (denoted  o^) .  In  essence,  this  is 

precisely  what  happens  in  random  effects  models  or  Model  II  ANOVA. 

253.  For  random  effects,  the  nature  of  the  hypothesis  changes 
from  that  for  fixed  effects.  The  hypothesis  (Equation  10)  is  stated  in 
terms  of  the  variance  component  being  estimated.  Likewise,  the  expected 
mean-square  (Equation  11)  is  a  function  of  the  variance  component. 

Hn:o2  =  0  (10) 

0  a 

E(MS  )  =  a2  +  ka2  (11) 

p  £  a 

254.  The  meaning  of  the  F  test  is  clear.  If  the  null  hypothesis 

2 

(Equation  10)  is  true,  the  E(MS  )  =  a  *  E(MS  )  and  the  researcher 

o  e  e 

would  expect  the  F  value  to  draw  closer  to  one  as  the  variance  drew 
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closer  to  zero.  Conversely,  if  the  variance  of  cove  effects  (a  )  were 

a 

large,  the  F  value  would  increase  and  the  inference  would  be  that  a  sig¬ 
nificant  degree  of  variability  exists  among  the  reservoirs  of  the 
region.  Naturally,  the  researcher  would  want  to  estimate  this  variance. 
As  with  post-ANOVA  tests  for  fixed  effects,  this  estimation  enterprise 
will  be  pursued  later. 

255.  Several  issues  need  to  be  raised  and  explained  before  con¬ 
cluding  the  discussion  of  random  effects.  Random  effects,  by  defini¬ 
tion,  Involve  a  two-stage  random  sampling  process.  The  first  stage 
involves  the  random  selection  of  a  set  of  treatment-populations  while 
the  second  stage  involves  the  random  sampling  of  units  from  each 
treatment-population.  There  are  cases,  however,  where  the  first  stage 
will  not  actually  involve  a  random  sampling  process.  If  this  occurs, 
a  minimum  requirement  is  that  the  researcher  must  be  willing  to  assume 
that  his  set  of  treatment-population  represents  a  random  sample. 

256.  The  second  issue  is  tied  to  the  first.  In  the  presence  of 

random  effects,  special  assumptions  (Equation  12)  are  made  in  addition 

to  these  made  about  error  effects  (Equation  5).  The  assumptions  of 

Equation  12  read,  "The  a  values  are  normally  distributed  about  a  mean  of 

2 

zero  with  variance  .  These  assumptions  are  not  problematic  in  con¬ 
cept.  They  simply  mirror  the  assumptions  made  about  residual  or  error 
effects  which  are,  in  actuality,  random  effects  themselves. 

a,  -  (o,  o2)  (12) 

I  a 

257.  The  third  issue  is  simply  a  point  of  clarification.  The 
models  of  an  ANOVA  need  not  be  simply  fixed  effects  or  random  effects 
models.  The  structure  or  nature  of  the  treatment-population  may  be 
partially  random.  If  so,  the  models  are  called  "mixed  models." 

Examples  of  these  types  of  models  are  discussed  in  a  later  section. 

258.  The  final  issue  is  related  to  the  nature  of  the  inferential 
process  in  random  effects  models.  Sometimes  a  completely  valid  F  test 
is  unavailable.  This  situation  occurs  in  studies  involving  unbalanced 
data  (i.e.,  unequal  sample  sizes  from  each  treatment-population).  The 


nature  of  the  inferential  process  thereby  becomes  one  of  estimating 
variance  components  rather  than  specific  tests  of  hypotheses.  Variance 
component  estimation  is  discussed  in  more  detail  later.  As  a  final 
note,  the  problems  caused  by  unbalanced  data  within  the  content  of 
random  effects  do  not  occur  in  fixed  effects  models. 

Fixed  effects  and  arrange- 
ments  of  treatment-populations 

259.  Inferences  about  fixed  effects  are  normally  restricted  to 
(a)  tests  of  hypotheses  about  significant  differences  between 
treatment-population  means,  and  (b)  point  or  interval  estimation  of 
treatment-population  means  and  mean  differences.  Remember,  sampling  or 
experimental  designs  refer  to  the  plan  or  procedure  followed  to  obtain  a 
random  sample  of  units  from  a  series  of  treatment-populations.  Except 
for  the  prior  and  more  detailed  discussion  of  fixed  and  random  effects, 
little  has  been  said  about  how  the  set  of  treatment-populations  to  be 
studied  might  be  defined. 

260.  Often  treatment-populations  differ  according  to  more  than 
one  dimension  or  can  be  distinguished  by  more  than  one  factor.  Consider 
a  study  designed  to  evaluate  surface  water  quality  in  two  coves  (c^,  c^) 
as  a  function  of  two  depths  (d^,  d^)  across  three  seasonal  months 

(m^,  m^ ,  m^)  and  suppose  further  that  several  water  samples  are  drawn  at 
each  depth  from  each  cove  during  each  month.  As  a  result,  12  treatment- 
population  combinations  exist,  but  the  treatment-populations  are  distin¬ 
guished  by  three  factors  or  dimensions,  cove  at  two  levels,  depth  at  two 
levels,  and  month  at  three  levels.  The  cross-classification  of  these 
factors  yields  the  12  treatment-population  combinations.  The  factors 
are  said  to  be  cross-classified  because  each  level  of  each  factor,  or 
each  cove  sampling,  is  conducted  during  the  same  months  at  the  same 
depths.  This  property  of  cross-classification  yields  a  set  of  treat¬ 
ment-populations  which  are  said  to  follow  a  "factorial  arrangement  of 
factors. " 

261.  Contrast  this  factorial  structure  to  the  following  example. 
Suppose  the  study  involved  two  different  reservoirs  (r^,  but  two 
coves  from  one  reservoir  (c^  c2  from  and  three  coves  from  the 
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second  reservoir  (c^,  c2»  c^  from  r2) .  Suppose  further  that  the  sam¬ 
pling  of  coves  was  done  during  different  months.  For  instance,  let  the 
12  months  be  denoted  by  m^,  n^.-.m^  with  the  following  sampling  scheme 
r^  during  n^,  m1Q  ;  r^  during  m^,  m2,  mg  ;  r ^  during 

my  m^  ;  ric2  durin8  “5*  mn»  m12  *  r2C3  durin8  m2»  m7‘  m8  *  Note 
immediately  that  12  treatment-populations  exist  based  on  three  factors 

(reservoir,  cove,  month)  which  are  not  cross-classified,  different  coves 
within  each  reservoir,  and  different  sampling  dates  for  each  cove.  This 
property  of  different  levels  of  a  factor  at  different  levels  of  other 
factors  yields  what  is  termed  a  "nested"  or  "hierarchical"  structure. 

Factorial  arrangements 
(main  and  interaction  effects) 

262.  The  function  of  an  ANOVA  in  fixed  effects  models  is  to 
determine  whether  significant  differences  exist  among  treatment- 
population  means.  Table  21  and  Figure  19  present  the  mean  water  quality 
(WQ)  for  each  of  the  12  treatment-populations  which  are  based  on  a  2  by 
2  by  3  factorial  arrangement  of  three  factors  (cove,  depth,  month).  The 
raw  data  of  Table  21  are  totally  contrived  to  expedite  the  subsequent 
discussion.  Results  of  the  ANOVA  are  given  in  Table  22. 


til 

=  mean  water  quality  for  the  i  cove 
at  the  j*"*1  depth  (j“l...d“2) 
during  the  kC^  month  (k=l...m=3) 


(i-1. . .c=2) 


(13) 


263.  As  can  be  seen  easily  in  Figure  19  and  Table  21,  the  ANOVA 
would  lead  to  the  conclusion  that  differences  between  means  exist,  and 
water  quality  varies  as  a  function  of  cove,  depth,  and  month.  The 
question  arises,  "How  might  treatment-population  effects  be  explained?" 
According  to  the  factorial  structure  of  the  treatment-populations,  two 
types  of  effects  can  be  identified,  "main"  and  "interaction"  effects. 

264.  The  distinction  between  main  and  interaction  effects  is 
quite  simple.  Questions  about  main  effects  can  be  stated,  "Are  there 
significant  differences  among  the  means  at  each  level  of  a  single 
factor?"  For  instance,  questions  about  whether  coves  differ  in  mean 
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Table  21 


Hypothetical  Water  Quality  for  a 

2  by  2  by 

_3 

Factorial  Arrangement  of  Three 

Factors — 

Cove, 

Depth,  Month 

Cove  (1)  Depth  (j) 

Month  (k) 

Sample  (1) 

WQ  (i.lkl) 

Mean  (CDM^) 

1 

1 

1 

1 

1.8 

2.0 

1 

1 

1 

2 

2.2 

1 

2 

1 

1 

2.0 

2.0 

1 

1 

2 

1 

3.7 

1 

1 

2 

2 

4.3 

4.0 

1 

1 

2 

3 

4.0 

1 

2 

2 

1 

8.4 

1 

2 

2 

2 

7.5 

1 

2 

2 

3 

7.6 

8.0 

1 

2 

2 

4 

8.5 

1 

1 

3 

1 

5.5  . 

1 

1 

3 

2 

6.5 

6.0 

1 

2 

3 

1 

7.8 

1 

2 

3 

2 

8.2 

8.0 

2 

1 

1 

1 

1.0 

1.0 

2 

2 

1 

1 

0.9 

2 

2 

1 

2 

1.1 

1.0 

2 

2 

1 

3 

1.0 

2 

1 

2 

1 

3.6 

2 

1 

2 

2 

4.4 

4.0 

2 

2 

2 

1 

8.1 

2 

2 

2 

2 

7.9 

8.0 

2 

1 

3 

1 

0.8 

2 

1 

3 

2 

1.2 

1.0 

2 

1 

3 

3 

1.0 

2 

2 

3 

1 

3.0 

3.0 

water  quality  and  whether  mean  water  quality  changed  as 

a  function  of 

depth 

are  main  effect  questions, 

and  three  main  effect  questions  exist 

(one  for  each  factor) . 

265.  Questions 

about  interaction  effects  refer  to 

interrelation- 

ships 

among  the  levels 

of  the  different  factors  and  can 

be  stated. 

(a)  "Is  the  effect  of 

a  factor  the  same  across 

levels  of 

other  factors?" 

or  (b) 

"Are  the  mean  differences 

among  levels 

of  a  factor  consistent 

across 

levels  of  other 

factors?" 

If  the  answer  is  "yes. 

"  the  factors  do 

Table  22 

Results  of  ANOVA  for  the  Data  of  Table  21 


Source 

df 

SS 

MS 

F 

Pr  >  F 

CDM 

11 

209.538 

19.049 

126.99 

0.0001* 

Error 

14 

2.100 

0.150 

Total 

25 

211.638 

*  Significant  at  the  0.0001  level. 

not  interact  and  the  individual  factor  effects  are  said  to  be  "addi¬ 
tive."  If  the  answer  is  "no,"  an  interaction  effect  is  said  to  exist. 

266.  In  our  example,  four  interaction  effects  are  possible: 
interaction  effects  for  each  pair  of  factors  (i.e.,  the  three  two-factor 
interactions  of  C  with  D,  C  with  M,  and  D  with  M)  and  an  interaction 
effect  for  all  three  factors  (i.e.,  one  three-factor  interaction).  For 
instance,  the  two-factor  interaction  (C  by  D)  would  concern  whether  the 
change  in  water  quality  as  a  function  of  depth  was  the  same  for  both 
coves.  The  three-factor  interaction  (C  by  D  by  M)  would  refer  to 
whether  or  not  the  interrelationship  for  C  and  D  was  the  same  across 
months . 

267.  Generalization  to  higher  order  factorial  structures  is 
fairly  direct.  Consider  a  four-way  cross-classification  system  (arbi¬ 
trarily,  let  the  four  factors  be  A,  B,  C,  and  D) .  If  this  were  the 
situation,  four  main  effects  (A,  B,  C,  D) ,  six  two-factor  interactions 
(AB,  AC,  AD,  BC,  BD,  CD),  four  three-factor  Interactions  (ABC,  ABD,  ACD, 
BCD),  and  one  four-factor  interaction  (ABCD)  are  possible. 

268.  To  understand  the  nature  of  main  effects  and  interaction 
effects,  the  means  of  Table  21  (CDM^^)  and  Figure  19  require  additional 
summarization.  Table  23  and  Figure  20  present  the  means  as  defined  in 
Equations  14-16,  to  explain  main  effects,  while  Table  24  and  Figure  21 
present  the  means,  as  defined  Equations  17-19,  to  explain  the  two-factor 
Interactions.  Table  25  and  Figure  22  present  the  means  of  Table  21  and 
Figure  19,  but  the  presentation  is  in  a  form  that  is  better  suited  for 
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Table  23 

Summarization  of  Means  and  Mean  Differences  for  Each  Main 

Effect 


Factor 

Means 

Mean  Differences 

Cove  (1) 

C1 

5.6 

2.8 

Cx  -  C2  -  2.8 

C2 

5.5 


Table  24 

Summarization  of  Means  and  Mean  Differences  for  Each  Two- 

Factor  Interaction 


Cove  Differences 

Difference  by  Month 

fa. 

Cove 

by  Month 

of  Cove  Differences 

Month 

C1 

C2 

(CL  -  C2)  by  M 

(Mx  -  M3,  M2  -  M3) 

M1 

2.0 

1.0 

1.0 

-4.5 

M2 

6.3 

6.0 

0.3 

-5.2 

M3 

7.0 

1.5 

5.5 

Cove  Differences 

Difference  by  Depth 

fa. 

Cove 

by  Depth 

of  Cove  Differences 

Depth 

C1 

C2 

(Cx  -  C2)  by  M 

0>1  -  Dj) 

4.0 

2.0 

2.0 

-1.4 

°2 

7.1 

3.7 

3.4 

Depth  Differences 

Difference  by  Month 

fii. 

Depth 

by  Month 

of  Depth  Differences 

Month 

D1 

°2 

(5X  -  D2)  by  M 

(Mx  -  M3,  M2  -  M3) 

M1 

1.7 

1.25 

0.5 

3.8 

M2 

4.0 

-4.0 

-0.7 

3.0 

6.3 

-3.3 

th 
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THREE  FACTOR  INTERACTION 


Figure  22.  Three-factor  interaction  for  the  12 
treatment-populations 
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n^  -  number  of  samples  from  the  j1"*1  depth  during  the  kC^  month 

269.  The  essential  feature  of  any  effect  (main,  interaction, 
nested,  etc.)  is  that  each  is  defined  in  terms  of  differences  among 
means.  Hence,  every  effect  has  a  set  of  one  or  more  defining  contrasts. 
The  final  objective,  of  the  present  discussion  is  to  develop  the  contrast 
matrix  of  Table  25  and  to  establish  a  set  of  rules  for  specifying  the 
defining  contrasts  for  any  effect.  An  adequate  appreciation  of  a 
contrast  matrix  will  not  only  allow  for  the  presentation  of  many  more 
examples  than  would  otherwise  be  possible  but  will  also  significantly 
expedite  the  discussion  of  future  topics. 

270.  Only  one  contrast  is  needed  to  establish  the  presence  or 
absence  of  a  cove  main  effect.  If  water  quality  does  not  depend  on  the 
cove  from  which  the  water  was  drawn,  the  cove  means  (C^)  in  Table  23 
would  be  equal  and  the  difference  between  cove  means  would  be  zero 


(i.e.,  -  Cj  ■  0) .  As  is  obvious,  such  is  not  the  case  (i.e., 

■  2.0).  Water  quality  at  was  different  than  at  C£  •  Only 

one  contrast  among  means  is  required  because  the  cove  factor  has  only 
two  levels.  A  general  rule  (Rule  1)  for  main  effects  is  that  the 
required  number  of  defining  contrasts  is  or  i  less  than  the  total  number 
of  levels.  By  identical  arguments,  a  main  effect  for  depth  exists.  The 
magnitude  of  the  water  quality  variable  increased  with  depth.  Natu¬ 
rally,  more  depths  could  have  been  sampled  to  provide  a  more  thorough 
picture  of  change  in  water  quality. 

271.  A  month  main  effect  exists  along  with  the  cove  and  depth 

main  effects.  Water  quality  changed  with  time.  Since  the  factor  of 
month  had  three  levels,  only  two  defining  contrasts  are  required  even 
though  three  contrasts  are  possible  (i.e.,  -  M2,  M^  -  M^»  M2  -  M^) . 

Table  23  gives  the  contrasts,  M^  -  M^  and  M2  -  M^  . 

272.  The  means  and  mean  differences  show  that,  with  time,  water 
quality  rises  and  then  falls,  but  not  to  the  level  observed  for  the 
first  L.^nth.  A  second  general  rule  (Rule  2)  for  main  effects  is  that 
the  set  of  all  possible  defining  contrasts  consists  of  all  possible 
differences  among  means.  A  convenient  selection  consists  of  those 
differences  involving  the  mean  for  the  last  level  of  the  factor.  As  a 
precautionary  note,  other  contrasts  are  possible  (e.g.,  M^  -  2M2  -  M^  is 
a  contrast),  but  Rule  2  provides  a  convenient  and  easily  generated  set. 

273.  Having  chosen  the  two  defining  contrasts,  a  third  general 
rule  (Rule  3)  for  main  effects  (in  fact,  for  any  effect)  is  that  all 
defining  contrasts  must  be  zero  in  order  to  establish  the  absence  of  a 
main  effect.  Rule  3  can  be  shown  easily  with  some  simple  algebra. 

Merely  let  one  contrast  be  different  from  zero  (e.g.,  suppose 

M^  -  M^  ■  0  but  Mj  -  M^  -  2  ) .  By  the  first  contrast  M^  *  M^  ,  and 
by  the  second  contrast  M2  ■  2M^  .  Therefore,  M2  is  twice  as  big  as 
Mj  and,  as  such,  not  all  means  are  equal  (i.e.,  a  main  effect  exists). 
Rule  3  also  illustrates  an  extremely  important  principle — the  presence 
of  an  effect  does  not  imply  that  all  means  are  different.  If  an  ANOVA 
indicates  the  presence  of  an  effect,  then  the  inference  is  simply  that 


differences  exist  somewhere.  Precisely  where  these  differences  occur 
requires  supplementing  the  ANOVA  with  other  techniques.  Many  specific 
techniques  are  available  for  ascertaining  precisely  which  means  are 
different  from  each  other.  These  techniques  are  discussed  later,  but 
all  are  based  on  the  concept  of  contrasts  among  means. 

274.  Two-factor  interactions  are  more  complex  than  main  effects, 

but  certain  principles  are  the  same.  Attention  is  restricted  initially 
to  the  C  by  D  interaction  (Table  24,  Figure  21).  A  CD  interaction 
exists;  the  difference  between  and  is  exactly  the  same  at  both 

depths. 

275.  To  define  a  two-factor  interaction,  simply  perform  a 

double-contrast  operation  (Rule  4) :  (a)  choose  a  factor  and  evaluate 

its  defining  contrast(s)  at  each  level  of  the  second  factor,  then 

(b)  apply  the  defining  contrast (s)  of  the  second  factor  to  the  results 
of  the  contrasts  from  (a) .  Referring  to  the  C  by  D  interaction  in 
Table  24,  the  defining  contrast  for  cove  differences  (i.e.,  -  C^)  is 

applied  separately  at  each  depth.  The  defining  contrast  for  depth  dif¬ 
ferences  (i.e.,  -  D2)  is  then  applied  to  the  cove  differences.  The 

result  is  not  zero  and,  by  Rule  3,  an  interaction  exists.  The  double¬ 
contrast  operation  yields  Rule  5,  which  specifies  the  number  of  defining 
contrasts  required  to  establish  any  interaction  effect — the  number  of 
defining  contrasts  equals  the  product  of  the  number  of  defining  con¬ 
trasts  for  each  factor  involved  in  the  interaction  (e.g.,  since  both  the 
C  main  effect  and  the  D  main  effect  had  only  one  defining  contrast, 
only  one  defining  contrast  is  required  for  the  C  by  D  interaction). 

276.  Identical  arguments  are  used  in  Table  24  to  establish  the 
presence  of  a  C  by  M  and  D  by  M  interaction.  Figure  21  shows  that 
the  change  in  water  quality  as  a  function  of  time  is  different  depending 
upon  cove  (C  by  M)  and  depending  upon  depth  of  sampling  (D  by  M) . 
Restricting  attention  to  the  C  by  M  interaction  (same  arguments  apply 
to  the  M  by  D  interaction),  the  defining  contrast  for  coves  (C^  -  C2) 
is  applied  for  each  month  and  then,  by  Rule  4,  the  defining  contrasts 
for  month  (M^  -  M^,  M£  -  M^)  are  applied  to  the  cove  contrasts.  By 


i wnP9mK 


mRmnmiiai  v*  \»  ".pr*  h v  *  u  « t.-*  .■»  11  crc«  ww  ,%wi*r  ww  w 


Rule  5,  the  C  by  M  Interaction  requires  two  defining  contrasts  and,  by 
Rule  3,  all  must  be  zero  to  claim  no  interaction.  Obviously,  such  is 
not  the  case  for  either  C  by  M  or  M  by  D  interaction. 

277.  Evaluation  of  the  three-factor  interaction  (C  by  D  by  M) 
simply  requires  generalizing  Rule  4  to  produce  Rule  6:  to  evaluate  any 
higher  order  interaction:  (a)  establish  the  defining  contrast (s)  for 
any  Interaction  which  is  1  degree  lower  than  the  interaction  of  concern 
and  (b)  apply  the  defining  contrasts  of  the  ignored  factor  to  the 
defining  contrasts  of  the  lower  order  interaction.  Rule  6  is  followed 
in  Table  25.  The  defining  contrasts  for  the  C  by  M  interaction  are 
established  at  each  depth  and  then  the  defining  contrast  for  depths  is 
applied.  By  Rule  5,  the  three-factor  interaction  requires  two  defining 
contrasts,  and  by  Rule  3,  no  three-factor  interaction  exists  because  all 
defining  contrasts  equal  zero. 

278.  The  results  of  all  main  effect  and  interaction  effect  com¬ 
parisons  may  be  summarized.  By  Figure  19,  differences  existed  between 
means.  Not  all  effects  (main  and  interaction)  contributed  to  the  dif¬ 
ferences  witnessed  in  Figure  19 — the  differences  were  not  due  to  the 

C  by  M  nor  the  C  by  D  by  M  interaction,  even  though  a  quick  glance  at 
all  ongoing  fluctuations  across  coves,  months,  and  depths  would  seem  to 
indicate  otherwise  (Figure  19  or  22).  Consequently,  visual  inspection 
is  insufficient  for  establishing  higher  order  interaction  effects. 
Finally,  main  effects  and  interaction  effects  constitute  only  one  set  of 
contrasts  among  means.  Many  contrasts  are  possible;  the  researcher's 
task  is  to  select  the  set  of  contrasts  which  has  the  most  meaning  for 
the  research  questions  at  hand. 

Hierarchical  arrangements 

279.  "Nested"  or  "hierarchical"  arrangements  are  not  that  far 
removed  from  a  factorial  arrangement.  Structurally,  the  difference  is 
that  there  are  different  levels  of  one  factor  at  each  level  of  another 
factor.  For  instance,  a  study  may  involve  two  reservoirs  and  three 
recreational  areas  on  each  reservoir,  thereby  yielding  six  treatment- 
populations.  The  nature  of  the  inferences  about  the 
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treatment-population  effects  is  somewhat  different  from  those  under  a 
factorial  structure.  To  describe  the  inferential  process,  the  data  of 
Table  26  can  be  used.  In  Case  I  there  is  a  reservoir  main  effect  (y 

•  J 

are  not  equal),  but  a  reservoir  main  effect  does  not  exist  in  Case  II. 

280.  A  word  of  caution  is  required  regarding  main  effects  in  a 
hierarchical  structure.  The  marginal  means  are  based  on  different 
levels  of  another  factor.  Hence,  main  effects  comparisons  are  not  made 
across  common  or  comparable  levels  of  other  factors.  This  word  of  cau¬ 
tion  also  points  to  the  character  of  nested  effect  questions.  Rather 
than  asking  about  comparability  of  factor  levels  as  a  function  of  other 
factors,  as  is  done  with  factorial  structures,  hierarchical  structure 
inquiries  take  the  form,  "Are  there  differences  among  the  particular 
levels  of  a  factor  within  each  level  of  other  factors?"  Based  on  the 
data  of  Table  26  this  question  must  be  answered  in  the  affirmative  in 
both  cases. 


281.  It  probably  goes  without  saying  that  as  the  number  of  fac¬ 
tors  and  number  of  levels  of  each  factor  increase,  the  interpretation  of 
higher  order  interaction  effects  and  nested  effects  becomes  a  more  dif¬ 
ficult  enterprise.  In  actuality,  this  is  true  only  for  higher  order 
interaction  effects.  The  interpretation  of  higher  order  nested  effects 
is  incredibly  easier.  The  reason  lies  in  the  definition  of  the  effects. 
In  the  case  of  higher  order  interactions,  the  effects  are  stated  in 
terms  of  differences  of  differences  of  differences.  For  instance,  if 
the  situation  involved  a  four-way  interaction  of  A  ,  B  ,  C  ,  and  D  , 
then  the  four-way  interaction  would  ask  if  a  three-way  interaction  (such 
as  A  by  B  by  C)  was  constant  across  the  fourth  factor  (D) .  But  evaluat¬ 
ing  the  three-way  interaction  across  the  fourth  would  involve,  "Are  the 
differences  (among  C)  of  the  differences  (among  B)  of  the  differences 
(among  A)  constant  over  D?"  But  for  nested  effects,  the  question  is 
simply  one  of  whether  or  not  differences  do  exist  among  levels  of  the 
factor  which  is  nested. 

282.  Suppose  a  three-factor  hierarchical  arrangement  of  A  ,  B 
within  A  (B/A)  ,  and  C  within  B  within  A  (C/B/A)  is  employed.  Letting 
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(Factor  A  at  3  Levels)  and  Depth  (Factor  B  at  2  Levels) — All  Entries 


Are  Means  (p^  , 


Area  (i) 


Reservoir  (1) 


denote  the  ABC^^  means,  the  effects  may  be  defined  as 
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Notice  that  these  effects  are  0  if  and  only  if  the  means  within  a  group 
are  equal.  For  instance  S  ,  =0  if  and  only  if  all  6  means  (p  .) 

til  J  ’  ***  ^ 

within  the  i  level  of  A  are  equal  to  the  overall  i  A  mean 

(p.  )  .  Thus,  nested  effects  exist  if  differences  among  means  within  a 

1  •  • 

group  simply  exist. 


Post-ANOVA  Hypotheses 


283.  Researchers  often  need  to  make  specific  contrasts  or  test 
specific  hypotheses  about  differences  among  means,  hypotheses  which  are 
not  answered  directly  by  main  effects,  interaction  effects,  nested 
effects  on  combinations  thereof.  For  instance,  the  analysis  of  Table  19 
tests  the  omnibus  hypothesis,  "Are  there  differences  among  coves?"  How¬ 
ever,  the  researcher  may  wish  to  know  whether  pairwise  differences 
exist,  that  is,  which  coves  are  different. 


284.  Define  a  contrast  by 


L 


i-1 


CiWi 


where 

p  **  it^1  treatment-population  mean 

*  contrast  multiplier  for  the  i  mean 

285.  Suppose  the  researcher  wants  to  test  whether  cove  1  differs 
significantly  from  cove  2. 

286.  Then  the  hypothesis  would  be 


H  :u,  -  «  0 

o  1  2 

where  c^  =  1  ,  m  0  ,  and  c^  “  -1  • 

287.  On  the  other  hand,  suppose  the  researcher  wants  to  know  if 
the  average  of  the  means  for  the  first  two  coves  is  significantly  dif¬ 
ferent  from  the  mean  for  the  third  cove. 

288.  Then  the  hypothesis  would  be 


H  : 
o 


0 


where  c^  -  1/2  ,  C2  *  1/2  ,  and  c^  ■  -1  . 

289.  To  test  these  hypotheses,  an  F  test  statistic  may  be  defined 
as  follows : 


F  = 
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a 

where  L  -  c.x.  . 

i-1  1  X 

290.  Referring  to  Tables  17  and  19,  and  the  latter  hypothesis, 

F  _  [1/2(29.0)  +  1/2(31.4)  -  22. 0]2 

2.69  ami  +  <im!  +  hui 

4  5  3 

67.24 

1.20 

=  56.03 

291.  The  question  is  whether  the  F  test  statistic  is  significant. 

It  is  here  that  many  methods  are  available  which  differ  according  to 

control  of  the  Type  I  error  (a)  rate  so  that  when  several  contrasts  are 

made,  the  desired  a-level  is  maintained.  Two  methods  are  Bonferroni' s 

and  Scheffe's.  The  Bonferroni  method  defines  the  critical  F  value  for 

a/k  with  degrees  of  freedom  equal  to  1  and  df  .  For  instance  (see 

e 

Tables  17  and  19),  if  a  «  0.05  and  five  contrasts  are  made,  F  **  10.6 

for  a/k  ■  0.01  ,  df  *  1  ,  and  df  •  9  .  The  Scheffe  method  uses 

e 

F*  *  (a-l)F  where  F  is  defined  for  a  and  degrees  of  freedom  equal 
to  a-1  and  dfg  .  For  instance  (see  Tables  17  and  19),  if  a  m  0.05 
and  five  contrasts  are  made,  then  F*  «  2(4.26)  **  8.52  where  4.26  is  F 
for  a  -  0.05  and  degrees  of  freedom  equal  to  2  and  9. 

292.  As  a  final  note,  the  Bonferroni  method  is  the  preferred 
method  for  a  priori  planned  comparisons  and  if  the  number  of  contrasts 
is  less  than  or  equal  to  the  number  of  means.  As  the  number  of  con¬ 
trasts  increases  and  if  the  contrasts  are  mainly  a  matter  of  post  hoc 
searches  for  differences  among  means,  the  Scheffe  method  would  be 
preferable. 

293.  Before  leaving  the  topic  of  contrasts,  we  will  address  a 
special  set  of  contrasts,  most  often  called  orthogonal  polynomials, 
which  are  particularly  useful  for  trend  analyses.  These  contrasts  apply 


when  the  treatments  represent  equally  spaced  quantitative  levels  such  as 
time  or  distance.  For  Instance,  suppose  the  levels  of  the  treatment- 
populations  represent  4  months.  The  research  question  might  be,  "across 
months,  what  is  the  nature  of  the  trend  in  the  means?"  Because  there 
are  4  months,  the  trend  might  be  linear,  quadratic,  cubic,  or  some  com¬ 
bination  of  linear,  quadratic,  and  cubic.  Multipliers  for  polynomial 
contrasts  for  up  to  four  treatment  levels  are  given  in  Table  27. 


Nonparametric  Analyses 


294.  The  motivation  underlying  the  use  of  nonparametric  tech¬ 
niques  is  quite  simple  and  direct.  Nonparametric  (NP)  methods  rely  gen¬ 
erally  on  a  set  of  assumptions  that  is  less  stringent  than  the  set 
required  by  the  analyses  presented  in  the  previous  chapters.  Basic  to 
all  previously  discussed  ANOVA  was  a  set  of  assumptions  about  the  under¬ 
lying  distribution  of  the  treatment-populations.  The  restrictions  were 
that  the  dependent  variable  (usually  stated  in  terms  of  residual 
effects)  comes  from  underlying  continuous  distributions  which  are  normal 
with  equal  variances.  At  face  value  these  constraints  appear  quite 
restrictive.  Many  would  argue  that  those  conditions  are  rarely 
satisfied.  For  instance,  if  there  are  in  fact  no  mean-differences  among 
the  treatment-populations,  then  the  significance  level  (a  =  Type  I  error 
rate)  represents  the  probability  of  declaring  mean-differences  based 
simply  on  chance  or  the  random  sampling  process  (that  is,  a  -  percent  of 
the  time  significant  differences  will  be  detected  merely  by  chance) . 
However,  the  Type  I  error  rate  statement  is  true  if  and  only  if  the 
assumptions  of  the  model  are,  in  fact,  satisfied.  The  question  thereby 
becomes,  "If  one  or  more  of  the  conditions  of  the  model  are  not 
satisfied,  are  alternative  methods  available  which  maintain  'correct¬ 
ness'  in  the  probability  inferences?"  The  answer  is  a  resounding  "yes." 

295.  Nonparametric  analyses  of  variance  yield  valid  inferences 
about  treatment-population  differences  yet  rely  upon  satisfying  a  less 
stringent  set  of  conditions.  However,  as  is  usually  the  case  in  the 
world,  one  does  not  get  something  for  nothing.  Depending  upon  the 
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Table  27 

Multipliers  for  Polynomial  Contrasts  for  up  to  Four  Treatment  Levels 


degree  of  mildness  in  the  constraints  Imposed  for  NP  methods,  the  loss 
incurred  may  lie  in  (a)  less  specificity  in  the  precise  nature  of  the 
differences  among  the  treatment-population  that  the  methods  are  sensi¬ 
tive  to  or  (b)  less  power  in  detecting  differences  which  do  in  fact 
exist.  Parenthetically,  in  the  latter  case,  the  loss  in  power  is  rela¬ 
tive  to  the  particular  methods  used  but  more  importantly  is  relative  to 
the  degree  of  violation  of  the  assumptions  of  the  parametric  analyses. 
While  some  NP  techniques  incur  minimal  loss  in  power  efficiency,  others 
maintain  equal  or  greater  power  efficiency  depending  upon  the  degree  of 
violation  of  the  parametric  assumptions. 

296.  The  fundamental  point  of  departure  from  parametric  methods 
lies  in  the  assumed  form  of  the  underlying  treatment-population  distri¬ 
butions.  Whereas  a  parametric  ANOVA  requires  normal  populations,  NP 
analyses  do  not.  In  fact,  an  NP  ANOVA  does  not  even  require  similarity 
of  distribution  form  across  treatment-populations.  Similarly,  whereas  a 
parametric  ANOVA  requires  the  normal  populations  to  possess  equal  spread 
(variance),  an  NP  ANOVA  does  not.  Hence,  NP  methods  are  often  called 
"distribution-freer"  methods.  Some  clarification  of  the  term  "freer"  is 
warranted.  The  use  of  "distribution-freer"  does  not  say  that  no  distri¬ 
butional  assumptions  are  made.  Quite  the  contrary.  The  terminology 
simply  implies  that  less  restrictive  conditions  are  required.  For 
Instance,  NP  methods  share  with  parametric  methods  two  assumptions — the 


residual  effects  are  continuous  and  distributed  Independently. 

297.  Beyond  these  two  basic  assumptions  of  continuity  and  inde¬ 
pendence,  NP  methods  vary  in  their  assumptions,  and  this  variation  is 
reflected  in  the  precise  meaning  of  the  estimation  and  hypothesis  test¬ 
ing  processes.  NP  analyses  of  variance  techniques  are,  in  a  strict 
sense,  sensitive  to  differences  between  treatment-populations  other  than 
those  selected  by  mean-separation.  However,  this  sensitivity  to  differ¬ 
ences  in  shape  or  dispersion  (spread)  is  minimal.  However,  if  assump¬ 
tions  such  as  equivalence  of  form  (not  necessarily  normal)  and/or  spread 
are  valid,  NP  methods  test  for  and  estimate  mean-differences.  For 
instance,  some  NP  techniques  require  in  addition  to  the  two  basic 
assumptions  of  continuity  and  independence  only  the  assumption  that 
treatment-population  distributions  are  symmetrical  regardless  of  their 
form  or  spread. 

298.  The  mechanics  of  NP  ANOVA  are  conceptually  easy  to  compre¬ 
hend.  The  estimation  and  test  processes  are  based  on  transforming  the 
original  data  by  ranking  the  data.  Hence,  NP  analyses  are  oftentimes 
referred  to  as  "analyses  of  variance  by  ranks."  The  meaning  of  the  pro¬ 
cedures  is  simple.  Suppose  we  desire  to  compare  three  treatment- 
populations  and  suppose  a  random  sample  of  five  experimental  units  is 
selected  from  each.  Let  us  rank  the  observations  in  ascending  order 
irrespective  of  the  treatment-population  yielding  the  data.  As  a  result 
and  under  the  presumption  of  no  ties,  the  ranks  will  range  from  1  to 

15  units  of  1.  Now  if  no  difference  between  treatment-populations 
exists,  one  would  expect  the  average  of  the  ranks  from  each  sample  to  be 
equal.  Conversely,  if  substantive  differences  exist  (e.g..  Population  A 
>  Population  C  >  Population  B) ,  the  means  of  the  sample  ranks  should 
reflect  the  underlying  inequalities.  Therein  lies  the  meaning  of  NP 
techniques . 

299.  Three  nonparametric  analyses  will  be  illustrated:  one-way 
ANOVA  by  ranks  for  a  completely  randomized  design  (Kruskal-Wallis  test), 
two-way  ANOVA  by  ranks  for  a  randomized  block  design  (Friedman  test), 
and  correlation  (Spearman's  rho) . 


300.  The  Kruskal-Wallis  test  for  a  one-way  ANOVA  by  ranks  is 


based  on  the  H  statistic,  calculated  as 


19  k 
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where 

n^  “  number  of  observations  in  treatment-population  i 

k  -  number  of  treatment-populations 
k 

N  *  ^  n.  ■  total  number  of  observations 
i-1 

“  sum  of  the  ranks  of  the  n^  observations  in  treatment 
population  1 

2 

301.  The  calculated  H  statistic  is  compared  to  a  chi-square  (x  ) 
value  with  degrees  of  freedom  equal  to  the  number  of  treatment  popula¬ 
tions  minus  one  (DF  -  k  -  1).  If  the  calculated  H  statistic  exceeds  the 
chi-square  value,  the  null  hypothesis  can  be  rejected. 

302.  As  an  example  of  a  one-way  ANOVA  by  ranks,  consider  the  data 
presented  in  Table  28.  Twelve  samples  for  chlorophyll  a  were  taken  from 
each  of  three  reservoirs,  and  the  data  ranked  in  ascending  order.  In 
nonparametric  tests,  population  parameters  are  not  used  in  statements  of 
hypotheses,  so  the  hypotheses  in  this  example  are  stated  as 

H  : chlorophyll  concentration  is  the  same  in  all  three 
reservoirs 

H^: chlorophyll  concentration  is  not  the  same  in  all  three 
reservoirs 

303.  The  statistics  necessary  to  calculate  H  are 
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and 


-  118.029  -  111.0 

*  7.029 

2 

The  critical  value  of  x  with  DF  -  2  and  o  ■  0.05  is  5.991,  so  the 
null  hypothesis  of  equal  chlorophyll  concentration  can  be  rejected. 

304.  Just  as  with  the  parametric  one-way  ANOVA,  the  nonparametric 
one-way  ANOVA  Indicates  only  whether  significant  differences  exist. 
Rejection  of  the  null  hypothesis  by  the  one-way  NP  ANOVA  does  not  indi¬ 
cate  which  of  the  treatment-populations  are  different.  Nonparametric 
multiple  comparisons  can  be  performed  in  a  manner  similar  to  the 
Student-Newman-Kuels  test  by  using  rank  sums  instead  of  means. 

305.  In  order  to  perform  the  multiple  comparisons,  the  rank  sums 
from  the  Kruskal-Wallis  test  are  arranged  in  Increasing  order  of  magni¬ 
tude.  Pairwise  differences  between  rank  sums  are  then  computed.  The 
standard  error  is  calculated  as 

SE  - 

where 

n  *  number  of  observations  in  each  treatment  population 
p  ■  range  of  rank  sums 
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Note  that  this  multiple  range  test  requires  that  there  be  equal  num¬ 
bers  (n)  of  data  in  each  of  the  treatment  populations. 

306.  Using  the  example  of  chlorophyll  a  concentrations  from  three 
reservoirs,  the  rank  sums  can  be  ordered 

Rank  of  rank  sums  (RR^)  123 

Rank  sum  (R^)  163  206  297 

(Res  2)  (Res  1)  (Res  3) 

and  the  multiple  comparisons  are  given  in  Table  29.  Based  on  the  mul¬ 
tiple  comparisons  it  can  be  concluded  that  chlorophyll  concentration  in 
reservoir  3  is  greater  than  that  in  reservoirs  1  and  2  and  that 


Comparison 
(R,  vs.  R,) 


3  vs.  1 
3  vs.  2 
2  vs.  1 


Difference 

(Rj  -  RJ 


36.479 

24.495 

24.495 
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3.3] 
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15 

2.7; 

12  1 

55 

2.7; 

12  | 
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chlorophyll  concentration  Is  the  same  in  reservoirs  1  and  2.  These  con¬ 
clusions  can  be  summarized  by 

T63  206  297 


Res  2  Res  1  Res  3 

307.  Friedman's  test  is  a  nonparametric  method  that  can  be 

applied  on  a  randomized  block  design.  Remember  that  a  randomized  block 

design  consists  of  b  blocks  and  t^  treatments.  To  perform  Friedman's 

test,  the  data  within  each  block  are  ranked  (i.e.,  values  are  ranked 

with  respect  to  members  of  the  given  block)  and  then  the  ranks  are 

2 

summed  for  each  treatment.  The  test  statistic  x  is  calculated  as 
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where 


b  ■  number  of  blocks 
t  ■  number  of  treatments 
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n  ”  number  of  observations  per  cell 
R.  ■  rank  sum  for  the  1  treatment 

l  2  2 

The  calculated  x^  Is  compared  to  a  critical  x  value  with  degrees  of 

freedom  equal  to  the  number  of  treatments  minus  one  (DF  ■  t  -  1).  If 
2  2 

the  calculated  x^  exceeds  the  critical  x  ,  the  null  hypothesis  can 
be  rejected. 


Table  30 

Data  for  Two-Way  ANOVA  by  Ranks 


308.  An  example  of  a  two-way  ANOVA  by  ranks  for  a  randomized 
block  design  can  be  based  on  the  data  in  Table  30.  Soluble  reactive 
phosphorus  (SRP)  concentrations  were  measured  in  four  replicate  samples 
from  each  of  four  stations  sampled  in  June,  July,  and  August.  In  this 


example,  stations  are  the  blocks  and  months  represent  the  treatment 
populations. 

309.  Note  that  the  ranking  process  Is  applied  within  each  block 

(station)  and  separately  for  each  block.  Once  the  ranks  are  assigned, 
the  rank  sums  can  be  calculated  by  summing  the  ranks  within  each 
treatment-population  (month) .  The  rank  sums  are 

June  Rj  =  166.0 

July  ■  92.5 

August  R^  =  53.5 

Also,  for  this  example, 

b  -  4 
n  =  4 
t  -  3 

310.  The  null  hypothesis  to  be  tested  is: 

Hq:SRP  concentration  is  the  same  for  June,  July  and  August 
with  the  alternative  hypothesis 

H^:SRP  concentration  is  not  the  same  for  June,  July  and  August 

2 

311.  The  calculated  x  statistic  is 


12 


(4)  (3)  (4)^[  (4)  (3)  +  1] 


[(166.0) 2  +  (92.5) 2  +  (53. 5) 2] 


(166.0  +  92.5  +  53.5)' 
3 


‘r  ’  27450  I38-974'5  - 


97,344.0 

3 


x^  *  (0.00481)(6,526.5)  -  31.39 


The  critical  value  of  with  DF  =  2  and  a  =  0.05  is  5.991  so  the 

null  hypothesis  can  be  rejected. 

312.  Nonparametric  or  Spearman's  correlation  is  simple  and 
direct.  This  method  of  correlation  is  useful  when  the  bivariate  data 
are  not  normally  distributed.  To  perform  a  rank  correlation,  simply 
rank  separately  the  x  and  y  data  and  compute  the  correlation  coefficient 
as  shown  for  a  simple  linear  correlation  using  the  ranks  rather  than  the 
raw  data. 

313.  The  data  in  Table  31  can  be  used  for  an  example  of  rank  cor¬ 
relation.  Assume  that  these  flow  and  concentration  data  were  taken  at 
the  major  inflow  to  a  reservoir  and  then  ranked  in  ascending  order.  To 
calculate  the  Spearman  rank  correlation  coefficient  rg  ,  it  is  first 
necessary  to  compute 
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where 

X^  ■  rank  of  the  X  value 
»  rank  of  the  Y  value 
n  ■  number  of  bivariate  pairs 
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and  for  the  example 


61.5 _ 

sc  ■  ■—  — ■  .  —  ■  . 

S  ^(82.5)  (82.5) 


0.745 


314.  To  determine  the  significance  of  the  rank  correlation,  a 
t  test  is  used  in  the  same  manner  as  was  used  for  the  simple  linear 
correlation  coefficient. 


Multivariate  Data  Analysis 

315.  Many  studies  of  reservoir  water  quality  involve  multiple 
variables,  multiple  samples,  and/or  multiple  water  bodies.  In  those 
situations,  statistical  studies  will  be  both  univariate  (and  involve 
methods  described  elsewhere  in  this  manual)  and  multivariate.  The 
multivariate  methods  described  below  can  be  used  to  greatly  enhance  the 
limnologist's  understanding  of  water  quality  relationships  in  and  among 
reservoirs.  Equally  important,  computer  programs  (like  SAS)  with  multi 
variate  methods  are  available  that  may  be  used  as  easily  as  are  their 
univariate  counterparts.  As  with  the  application  of  all  statistical 
methods,  however,  the  use  of  multivariate  methods  must  occur  with  con¬ 
sideration  of  the  assumptions  behind  their  inferential  use. 

Issues  that  can  be  addressed 
with  multivariate  analysis 

316.  The  types  of  research  questions  that  can  be  examined  with 
multivariate  statistical  methods  can  be  conveniently  grouped  into  a 
relatively  few  categories.  For  the  methods  discussed  herein,  these 
categories  are: 

a.  Characterization  of  the  strength  of  a  relationship 
between  and/or  among  variables  (multiple  and  canonical 
correlation) . 

b.  Classification  of  variables  or  observations  (cluster 
analysis) . 

c.  Examination  of  structure  within  a  system  (principal 
component  and  factor  analysis). 


d.  Development  of  predictive  relationships  for  assignment 
to  predefined  groups  (discriminant  analysis). 

Each  of  these  tasks  or  functions  for  multivariate  methods  is  described 

briefly  below. 

317.  The  "strength"  of  a  bivariate  relationship  is  often  conve¬ 
niently  expressed  in  terms  of  a  simple  correlation  coefficient.  Multi¬ 
variate  analogs  to  the  simple  correlation  coefficient  exist  for  two 
situations.  First,  multiple  correlation  is  used  to  describe  the  degree 
of  relationship  between  a  single  dependent  variable  and  a  combination  of 
two  or  more  independent  variables.  This  is  the  situation  that  occurs  in 
multiple  regression,  and  therefore  the  multiple  correlation  coefficient 
is  a  useful  indicator  of  the  goodness  of  fit  of  the  multiple  regression 
model.  In  truth,  the  multiple  correlation  coefficient  is  just  a  simple 
correlation  coefficient  with  a  new  variable  created  that  is  a  function 
of  two  or  more  original  variables.  The  second  multivariate  analog  for 
simple  correlation  is  canonical  correlation.  In  effect,  this  describes 
the  situation  with  linear  combinations  of  multiple  dependent  and 
multiple  independent  variables.  Thus,  the  canonical  correlation  coeffi¬ 
cient  may  be  used  to  describe  the  strength  of  a  relationship  between  a 
linear  combination  of  nutrient  variables  and  a  linear  combination  of 
biomass-related  variables.  The  linear  combinations  are  defined  by  the 
canonical  correlation  procedure  in  order  to  maximize  the  degree  of  cor¬ 
relation  between  these  two  sets  of  variables. 

318.  The  multivariate  procedure  called  cluster  analysis  may  be 
used  to  take  "objects"  and  group  them  into  categories  that  are  based  on 
the  relative  similarity  of  the  objects  as  expressed  in  a  set  of  pre¬ 
specified  variables.  For  example,  with  trophic  state  data  (on  several 
variables)  for  a  number  of  reservoirs,  the  limnologist  can  use  cluster 
analysis  to  create  groups  of  reservoirs  (based  on  these  trophic  state 
variables)  that  may  then  be  labeled  as  specific  trophic  state  categories 
(e.g.,  oligotrophic,  eutrophic,  etc.).  Alternatively,  the  objects  may 
be  sampling  stations  within  a  reservoir,  and  cluster  analysis  may  be 
used  to  group  the  stations  according  to  similarity  in  sampling  results. 
In  that  manner,  cluster  analysis  may  be  used  to  identify  redundant 


stations  if  sampling  effort  is  to  be  reduced. 

319.  It  is  not  uncommon  that  data  acquired  on  multiple  variables 
actually  represent  one  or  only  a  few  fundamental  characteristics.  For 
example,  multiple  nutrient  and  biomass  data  can  all  be  considered  to 
represent  the  single  concept  trophic  state.  The  companion  procedures, 
principal  components  and  factor  analysis,  can  be  used  to  extract  this 
simple  structure  (if  it  exists  at  all)  from  a  multivariate  data  set.  In 
other  words,  these  procedures  may  be  used  to  define  a  linear  function  of 
the  variables  which  represents  the  "common  element"  in  the  data.  For 
this  example,  the  univariate  composite  that  results  might  be  called  a 
trophic  state  index. 

320.  Cluster  analysis,  as  noted  above,  is  used  to  create  group¬ 
ings  of  observations  on  the  basis  of  the  similarity  of  observations  as 
represented  by  a  set  of  multivariate  data.  No  groups  were  defined  a 
priori.  Discriminant  analysis,  on  the  other  hand,  is  based  on 
predefined  groups  of  observations.  With  group  membership  established 
beforehand,  discriminant  analysis  can  be  employed  to  define  a  linear 
function  of  independent  variables  that  may  be  used  to  predict  group 
membership  for  a  new  observation.  For  example,  reservoirs  could  be 
preassigned  to  trophic  state  classes  on  the  basis  of  existing  biomass 
and  water  clarity  data.  Discriminant  analysis  may  then  be  used  to 
develop  a  linear  model,  perhaps  based  on  nutrient  loading  data  from 
these  reservoirs,  that  can  be  applied  to  predict  the  trophic  state  of  a 
previously  unclassified  reservoir  (from  the  nutrient  loading  estimates). 

Important  assumptions  for 
multivariate  statistical  inference 

321.  There  are  certain  assumptions,  and  in  a  more  general  sense, 
certain  conditions,  that  should  be  met  or  at  least  considered  when 
applying  multivariate  methods.  This  requirement  is  not  different  from 
similar  requirements  for  univariate  statistical  analysis.  In  fact,  the 
specific  conditions  or  assumptions  that  are  important  are  essentially 
the  same  as  those  noted  to  be  of  concern  in  previous  sections  devoted  to 
nonmultivariate  statistical  methods.  For  additional  information  on  this 
topic,  Tabachnlck  and  Fidell  (1983)  is  recommended. 


322.  A  "condition"  of  a  data  set  (that  is  not  an  assumption  as 
such,  but  can  affect  most  of  the  assumptions  discussed  below)  concerns 
outliers  or  influential  data  points.  As  noted  in  Part  III  concerning 
descriptive  statistics,  observations  that  are  far  removed  from  the  bulk 
of  the  data  points  (outliers)  can  have  a  major  effect  on  the  value  of 
commonly  calculated  statistics,  such  as  the  mean,  standard  deviation, 
and  variance.  Since  these  statistics  are  often  used  in  multivariate 
procedures,  outliers  can  affect  the  results  of  multivariate  analysis. 

One  approach  (Gnanadesikan  1977)  is  to  use  robust  analogs  of  the  mean 
and  variance;  however,  this  will  affect  inferential  statements  (e.g., 
significance  tests)  and  it  is  not  clear  what  adjustments  need  to  be  made 
to  employ  statistical  tests  with  the  robust  statistics.  A  better 
approach  may  be  to  carefully  screen  the  data  and  apply  a  transformation 
if  necessary.  In  essence,  the  methods  recommended  in  Parts  II  and  III 
may  be  used  to  do  this  screening.  This  can  be  undertaken  for  each 
variable  individually,  and  in  most  cases  this  will  serve  to  identify  all 
multivariate  outliers. 

323.  Collectively,  the  key  assumptions  for  the  multivariate 
methods  concern  normality.  Independence  of  observations,  constant 
variance  (homoscedasticity) ,  and  linearity.  It  is  important  to  realize 
that  the  assumptions  do  not  hold  for  all  methods  nor  do  they  necessarily 
hold  for  all  applications  of  the  same  method.  Further,  it  is  likely 
that  even  when  the  assumptions  are  necessary,  mild  violations  of  an 
assumption  (with  the  possible  exception  of  the  Independence  assumption) 
are  of  little  consequence. 

324.  The  assumption  of  normality  is  of  concern  when  hypothesis 
tests,  significance  levels,  or  confidence  intervals  are  determined 
because  these  procedures  require  normality.  Although  the  assumption  may 
refer  to  multivariate  normality,  it  is  often  adequate  to  simply  assess 
the  normality  assumption  on  each  variable  individually  and  apply  normal¬ 
izing  transformations  where  necessary.  While  univariate  normality  does 
not  guarantee  multivariate  normality,  it  will  probably  be  adequate  for 
most  applications. 


325.  Independence  of  observations  is  an  Important  assumption 
whenever  statistical  tests  that  are  a  function  of  sample  size  are 
applied.  The  problem  occurs  because  lack  of  Independence  means  that  the 
effective  sample  size  (based  on  the  amount  of  nonredundant  information 
in  the  data  set)  is  less  than  the  actual  sample  size.  Therefore,  to 
properly  conduct  statistical  tests  with  dependent  observations,  an 
effective  sample  size  should  be  calculated  for  use  in  testing  proce¬ 
dures.  Alternatively,  some  observations  could  be  eliminated  from  the 
data  set  such  that  the  remaining  observations  are  Independent.  For 
example,  if  the  data  set  consisted  of  weekly  dependent  observations,  but 
it  was  determined  that  biweekly  observations  were  independent,  then  one 
simple  (but  perhaps  inefficient)  solution  is  to  eliminate  every  other 
observation  and  conduct  statistical  tests  on  biweekly  data.  Of  all  the 
assumptions  listed  above,  the  Independence  assumption  is  most  critical 
on  the  basis  of  the  consequences  of  violation. 

326.  The  assumptions  of  linearity  and  homogeneity  of  variance  at 
times  are  necessary  for  statistical  tests  (statistical  inference)  and 
more  commonly  are  important  as  "conditions"  that  affect  interpretation 
in  descriptive  use  of  multivariate  methods.  When  one  or  both  of  these 
conditions  is  a  problem,  the  result  is  that  the  correlation  matrix  (or 
covariance  matrix)  for  the  multiple  variables  does  not  correctly  or 
adequately  represent  relationships.  For  example,  if  a  relationship 
between  two  variables  is  nonlinear,  but  the  multivariate  analysis  is  run 
for  a  linear  model  between  the  variables  (i.e.,  a  linearizing  transfor¬ 
mation  was  not  applied  beforehand) ,  then  the  result  will  not  reflect  the 
true  (nonlinear)  relationship.  This  in  turn  can  affect  the  conclusions 
drawn  by  the  investigator.  It  is  a  good  idea,  therefore,  to  examine  the 
data  in  univariate  and  bivariate  plots  (see  Part  II)  to  check  for  lin¬ 
earity  and  homoscedasticity.  Corrections  (e.g.,  a  linearizing  or  vari¬ 
ance  stabilizing  transformation)  made  on  the  basis  of  univariate  and 
bivariate  examination  of  the  data  should  usually  satisfy  the  multi¬ 
variate  assumptions  and  conditions. 

327.  Characterization  of  relationship  strength:  canonical  cor¬ 
relation.  As  noted  above,  the  strength  of  a  multivariate  relationship 
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can  be  assessed  using  either  multiple  or  canonical  correlation.  Since 
multiple  correlation  is  almost  always  associated  with  multiple  regres¬ 
sion,  the  reader  interested  in  multiple  correlation  is  referred  to  the 
section  on  regression  analysis.  The  discussion  in  this  section  focuses 
on  canonical  correlation;  useful  references  on  this  topic  include 
Tabachnick  and  Fidell  (1983)  and  Green  (1978). 

328.  Canonical  correlation  is  used  to  identify  and  estimate  a 
linear  function  (called  a  canonical  variate)  of  one  set  of  variables 
that  is  maximally  correlated  with  a  linear  function  of  a  second  set  of 
variables.  Additional  canonical  variates,  which  are  uncorrelated  with 
the  first  set,  may  also  be  identified.  The  procedure  results  in  infor¬ 
mation  that  is  primarily  descriptive  in  nature,  and  thus  it  has  been 
used  less  frequently  than  have  other  multivariate  methods  that  facili¬ 
tate  hypothesis  testing  and/or  prediction. 

329.  Generally  of  interest  to  those  who  apply  canonical  correla¬ 
tion  is  the  extent  of  relationship  between  two  set  of  descriptors  (or 
variables)  for  objects  (e.g.,  reservoirs)  under  study.  For  example,  one 
may  be  interested  in  the  relationship  (if  any  exists)  between  reservoir 
water  chemistry  and  cell  counts  for  dominant  algal  species  to  see  if 
certain  conditions  (e.g.,  low  inorganic  nitrogen  concentration)  favor 
(covary  with)  blue-green  dominance.  The  canonical  correlation  between  a 
set  of  water  chemistry  variables  (nutrients,  etc.)  and  a  set  of  vari¬ 
ables  indicating  cell  counts  for  major  algal  species  in  a  multireservoir 
study  could  be  quite  helpful. 

330.  Canonical  correlation  may  also  be  used  to  see  how  many 
"common  elements"  are  contained  within  two  sets  of  variables.  For 
example,  canonical  correlation  may  be  applied  to  a  set  of  reservoir 
water  chemistry  variables  and  a  set  of  reservoir  geomorphology  and 
hydrology  variables.  From  this  analysis,  the  first  canonical  variate 
might  represent  trophic  state  as  determined  from  the  canonical  weights 
or  coefficients,  which  in  that  case  might  be  highest  for  nutrients  in 
the  first  set  of  variables  and  for  depth  in  the  second  set  of  variables. 
In  addition,  the  second  canonical  variate  could  represent  reservoirs 
located  primarily  in  the  southern  United  States  with  the  largest 
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canonical  weights  on  variables  such  as  conductivity,  alkalinity,  and 
reservoir  volume. 

331.  When  two  or  more  pairs  of  canonical  variates  are  identified, 
the  investigator  can  express  the  relative  importance  of  each  canonical 
variate  pair  on  the  basis  of  the  percent  overlapping  variance  (equal  to 
the  squared  canonical  correlation  coefficient)  between  the  two  sets  of 
original  variables.  For  the  hypothetical  example  discussed  above,  it 
may  be  found  that  the  first  canonical  variate  pair  represents  60  percent 
overlapping  variance  and  the  second  canonical  variate  pair  has  15  per¬ 
cent  overlap.  This  information  helps  the  investigator  understand  the 
extent  of  commonality  within  a  set  of  variables.  In  addition,  one  can 
calculate  the  percent  variance  explained  by  a  canonical  variate  within 
each  of  the  two  sets  of  original  variables.  For  example,  in  the  pre¬ 
vious  hypothetical  example,  the  investigator  may  find  that  the  first 
canonical  variate  explains  70  percent  of  the  variance  in  the  nutrient 
variables  and  30  percent  of  the  variance  in  the  hydrology-geomorphology 
variables. 

332.  When  canonical  correlation  is  used  in  hypothesis  testing  or 
to  justify  statements  of  statistical  significance,  an  assumption  of 
multivariate  normality  is  necessary.  As  noted  in  the  previous  section, 
this  is  often  adequately  satisfied  by  creating  data  distributions  that 
are  approximately  univariate  normal.  Most  applications  of  canonical 
correlation  are  descriptive;  in  that  case,  multivariate  normality  is 
desirable  but  not  necessary.  Descriptive  applications  can  yield  mis¬ 
leading  information,  however,  if  the  data  distributions  are  highly 
skewed  or  exhibit  outliers.  It  is  wise,  therefore,  to  use  transforma¬ 
tions  if  necessary  to  create  roughly  symmetric  univariate  data  distri¬ 
butions,  and  to  carefully  examine  the  validity  of  any  outlying  data 
points.  If  outliers  cannot  be  removed  from  the  data  set  on  the  basis  of 
substantive  reasons,  then  it  is  recommended  that  two  analyses  be  run — 
one  with  the  outliers  and  one  with  the  outliers  excluded.  Both  analyses 
could  be  reported  so  the  reader  could  directly  relate  inclusion/ 
exclusion  of  the  outliers  to  the  canonical  correlations. 


333.  Finally,  canonical  correlation  depends  upon  linear  relation¬ 
ships  at  two  points  in  the  analysis.  First,  the  variables  within  each 
of  the  two  groups  of  data  are  combined  in  linear  canonical  variates. 
Therefore,  the  data  should  be  transformed  if  necessary  so  that  a  linear 
model  is  appropriate.  Second,  since  the  analysis  is  based  on  the  simple 
(product-moment)  correlation  coefficient,  bivariate  correlation  coeffi¬ 
cients  should  reflect  the  actual  linear  relationships  in  the  data. 

Again,  transformations  may  be  necessary  to  achieve  this.  For  example, 
if  the  relationship  between  Secchi  disk  depth  and  all  other  nutrient- 
related  variables  exhibits  a  hyperbolic  pattern,  and  the  inverse  of 
Secchi  disk  depth  straightens  (linearizes)  the  relationship,  then  the 
inverse  of  Secchi  disk  depth  should  be  used  in  the  canonical  correlation 
analysis. 

334.  Canonical  correlation  -  example.  This  example  and  some  of 
the  other  examples  presented  in  this  section  on  multivariate  analysis 
are  based  on  a  data  set  from  Walker  (1981).  The  data,  presented  in 
Table  32,  concern  water  chemistry  in  43  Corps  of  Engineers  reservoirs. 

335.  The  data  set  consists  of  seven  variables.  Three  of  the 
variables  (pH,  alkalinity,  and  conductivity)  relate  to  acidity  and 
dissolved  salts.  Four  of  the  variables  (total  phosphorus,  total  nitro¬ 
gen,  Secchi  disk  depth,  and  chlorophyll  a)  are  related  to  trophic  state. 
For  most  of  the  studies,  all  variables  but  pH  are  log-trans formed  to 
create  distributions  that  are  closer  to  univariate  normal  than  are  the 
distributions  of  the  untransformed  variables. 

336.  Given  the  uses  of  canonical  correlation  and  the  composition 
of  the  sample  data  set,  it  seems  appropriate  to  use  canonical  correla¬ 
tion  to  examine  the  relationship  between  a  linear  function  of  the 
acidity-salinity  variables  and  a  linear  function  of  the  trophic  state 
variables.  This  was  done  using  the  SAS  CANCORR  procedure.  Some  of  the 
major  features  of  the  SAS  output  are  summarized  below. 

337.  For  this  example,  the  PROC  CANCORR  statement  simply  identi¬ 
fied  the  data  set  and  labeled  one  set  of  variables  (the  acidity-salinity 
variables)  as  "VAR"  and  the  other  set  (the  trophic  state  variables)  as 
"WITH"  as  required  in  the  procedure.  Table  33  presents  a  portion  of  the 
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SAS  output:  the  canonical  correlations,  the  approximate  standard  error, 
the  F  statistic,  the  Pr  >  F. 

338.  Since  there  are  three  variables  in  the  smaller  of  our  two 
groups  (the  VAR  group),  three  canonical  correlations  are  estimated 
(rows  1,  2,  and  3  in  the  upper  entry  of  Table  33).  Each  canonical 
variable  is  uncorrelated  with  all  other  canonical  variables  except  its 
corresponding  canonical  variate  from  the  opposite  data  set.  The  first 
pair  of  canonical  variables  (represented  by  row  1)  is  constructed  so 
that  it  maximizes  the  correlation  between  a  linear  combination  of  the 
VAR  variables  with  a  linear  combination  of  the  WITH  variables.  The 
second  pair  of  canonical  variables  is  also  constructed  to  maximize  this 
correlation,  except  that  it  must  also  be  uncorrelated  with  the  first 
pair  of  canonical  variates.  This  continues  until  all  canonical  variates 
are  estimated. 

339.  In  Table  33,  the  canonical  correlations  are  given  for  the 

three  pairs  of  canonical  variates.  Note  that  the  first  two  correlations 

are  reasonably  high.  Recall  that  the  interpretation  of  the  canonical 

correlation  coefficient  squared  is  the  percent  overlapping  variance. 

Thus,  for  the  first  pair  of  canonical  variates,  there  is  about  54  per- 

2 

cent  overlapping  variance  (0.733  ).  This  means  that  54  percent  of  the 
variance  in  the  first  VAR  canonical  variate  is  explained  by  the  first 
WITH  canonical  variate.  If  these  canonical  variables  are  in  turn  highly 
correlated  with  the  original  variables  (this  is  examined  below),  then 
the  first  canonical  variate  may  describe  a  high  level  of  commonality 
between  the  two  sets  of  variables. 

340.  The  approximate  standard  errors  for  the  first  two  canonical 
correlations  are  relatively  small  in  comparison  to  the  magnitude  of 
these  correlations.  This  suggests  that  the  correlations  may  be  sig¬ 
nificantly  different  from  zero.  Confirmation  of  this  point  is  given  in 
the  F  statistic  and  the  probability  level  for  the  F  statistic  which 
shows  significance  at  >0.01  level.  In  these  tests,  the  F  statistic  is 
determined  for  the  canonical  correlation  in  its  row  as  well  as  for  all 
lower  canonical  correlations  simultaneously.  Thus,  the  F  statistic  in 
row  2  represents  the  simultaneous  tests  of  canonical  correlation 


Canonical 

Variate 

Canonical 

Correlation 

Approximate 
Standard  Error 

F  Statistic 

Degrees  of 
Freedom 

Pr  >  F 

1 

0.733 

0.071 

5.336 

12 

0.000 

2 

0.618 

0.095 

4.202 

6 

0.001 

3 

0.317 

0.139 

2.118 

2 

0.134 

Multivariate 

Test  Statistics 

and  F  Approximations 

Pr  >  F 

Statistic 

Value 

F 

Degrees  of  Freedom 

Wilks'  lamda 

0.237 

5.767 

12 

< 

0.001 

Pillai's  trace 

1.100 

5.502 

12 

< 

0.001 

Hottelling-Lawley  trace 

1.978 

5.715 

12 

< 

0.001 

Roy's  greatest  root 

1.161 

11.034 

4 

< 

0.001 

coefficients  two  and  three.  The  interpretation  of  these  tests  is  that 
only  the  first  two  canonical  correlation  coefficients  are  significant  at 
the  0.01  level. 

341.  The  next  grouping  of  the  output  in  Table  33  presents  four 
statistics  used  to  evaluate  the  significance  of  the  canonical  correla¬ 
tions  as  a  set.  While  each  statistic  is  slightly  different  from  the 
others,  they  all  test  this  feature  of  overall  significance.  Pillai's 
trace  may  be  the  most  robust  of  the  four  (Tabachnick  and  Fidell  1983), 
but  since  Wilks'  lambda  is  by  far  the  most  commonly  reported  of  the 
four,  it  is  the  one  we  discuss  here.  Since  tables  for  the  four  statis¬ 
tics  are  uncommon,  each  statistic  is  evaluated  for  significance  using  an 
F  statistic  approximation  (F  tables  are  far  more  common) . 


342.  Wilks'  lambda  is  defined  as 


Canonical  Variable  Canonical  Variable 


Variable 

VI 

V2  V3 

Variable 

W1 

W2 

W3 

Raw  Canonical  Coefficients 

pH 

-1.478 

3.846  -0.225 

Log (TP) 

2.822 

-2.250 

-3.427 

Log(ALK) 

4.402 

-3.947  -1.318 

Log(TN) 

0.546 

-0.986 

4.015 

Log(COND) 

-0.268 

1.003  2.769 

Log(SECCHI) 

0.706 

0.180 

-1.956 

Log  (CHLa) 

0.030 

4.313 

0.239 

Standardized  Canonical  Coefficients 

pH 

-0.719 

1.871  -0.110 

Log (TP) 

1.088 

-0.867 

-1.321 

Log(ALK) 

1.599 

-1.433  -0.479 

Log(TN) 

0.146 

-0.264 

-1.076 

Log(COND) 

-0.108 

0.404  1.117 

Log(SECCHI) 

0.232 

0.059 

-0.643 

Log (CHLa) 

0.010 

1.463 

0.081 

345.  For  interpretation  purposes,  however,  the  standardized 
canonical  coefficients  are  often  most  useful  as  they  provide  a  relative 
measure  of  the  importance  of  each  variable  in  determining  the  canonical 
variates.  For  this  example,  we  can  see  that  alkalinity  (log(ALK),  with 
standardized  canonical  coefficient  *  1.599)  is  the  most  important  deter¬ 
minant  of  VI,  followed  by  pH  (-0.719);  conductivity  (log(COND), 

with  -0.108)  is  least  important.  For  Wl,  phosphorus  concentration 
(log(TP),  with  1.088)  is  by  far  the  most  important  variable.  Thus,  the 
first  canonical  variable  is,  to  a  great  extent,  describing  a  relation¬ 
ship  between  alkalinity  and  phosphorus  that  is  not  shared  with  the  other 
variables  (except  perhaps  pH) . 

346.  Table  35  presents  correlations  between  the  original  seven 
variables  and  each  of  the  canonical  variables.  Note  that,  except  for 
sign  differences,  the  correlations  in  Table  35  often  (but  not  always) 
convey  the  same  information  as  do  the  standardized  canonical  coeffi¬ 
cients  in  Table  34.  Their  relationship  is  somewhat  analogous  to  the 
relationship  between  simple  correlation  coefficients  and  multiple 


regression  coefficients. 


Table  35 


Correlations 

Between  the  Original  Variables  and  Their 

Canonical 

Variables 

Canonical  Variable 

Variable 

VI 

V2 

V3 

pH 

0.616 

0.758 

-0.215 

Log(ALK) 

0.934 

0.356 

-0.039 

Log(COND) 

0.461 

0.227 

0.858 

Canonical  Variable 

Wl 

W2 

W3 

347.  Another  interesting  statistic  is  the  proportion  of  vari¬ 
ance  (pv)  in  the  original  variables  that  is  explained  by  the  corre¬ 
sponding  canonical  variate.  This  is  easily  calculated  by  summing  the 
appropriate  squared  correlation  coefficients  from  Table  35  and  dividing 
by  the  number  of  original  variables.  Thus: 

pv  -  E  <rlcv>2/«> 

i-1  C 

where 

r^cv  =  correlation  between  original  variable  i  and  canonical 
variable  cv 

K  =  number  of  original  variables 

348.  Using  the  correlations  in  Table  35,  the  proportion  of  the 
variance  in  the  acidity-salinity  variables  that  is  explained  by  VI  is: 

pv  =  [ (0.6155) 2  +  (0.9335) 2  +  0.4605)2]/3 
pv  *  0.487 

349.  Thus  about  49  percent  of  the  variance  in  the  acidity- 
salinity  variables  is  explained  by  the  first  canonical  variate  (VI). 

350.  It  is  also  interesting  to  determine  what  proportion  of  the 
variance  in  one  set  of  original  variables  is  explained  by  the  other 
canonical  variate.  This  is  called  "redundance"  (rd) ,  and  we  can  cal¬ 
culate  it  from: 

rd  -  (pv)(CC)2 

where  pv  is  the  proportion  of  variance  (determined  above)  and  CC  is 
the  canonical  correlation  coefficient.  For  example,  the  redundancy  for 
W1  and  the  acidity-salinity  variables  is: 

rd  ■=  (0.4897) (0. 733) 2 


351.  This  means  that  about  26  percent  of  the  variance  in  the 
acidity-salinity  variables  is  explained  by  the  opposite  canonical  vari¬ 
able  (Wl).  Thus,  W1  is  a  relatively  poor  predictor  of  the  acidity- 
salinity  variables.  The  redundancy  for  the  trophic  state  variables  and 
canonical  variate  VI  is  0.353,  indicating  that  VI  is  slightly  better 
as  a  predictor  of  "opposite"  original  variables  (than  is  Wl  ).  These 
low  redundancies  are  not  surprising,  confirming  our  original  belief  that 
there  is  not  a  strong  relationship  between  trophic  state  and 
acidity-salinity . 

352.  Cluster  analysis.  Cluster  analysis  is  a  classification 
method  that  may  be  used  to  group  or  identify  similar  objects.  These 
objects  may  be  reservoirs,  sampling  sites  (or  dates)  within  a  reservoir, 
or  water  quality  variables  (e.g.,  nitrogen,  chlorophyll,  alkalinity, 
etc.)  measured  in  one  or  more  water  bodies.  The  criterion  of  similarity 
may  be  defined  in  several  ways;  in  most  applications,  however,  it  is 
based  on  either  the  correlation  coefficient  or  the  Euclidean  distance 
(which  is  a  function  of  the  sum  of  squared  differences  between  attri¬ 
butes)  .  Before  discussing  clustering  criteria,  though,  let  us  first 
consider  the  types  of  problems  that  might  be  fruitfully  studied  using 
cluster  analysis.  For  additional  information  on  cluster  analysis,  Davis 
(1973)  or  Green  (1978)  may  be  consulted. 

353.  In  summarizing  water  quality  studies  among  reservoirs,  it  is 
often  informative  to  classify  the  water  bodies  according  to  various 
criteria.  Using  cluster  analysis,  the  investigator  could  classify  the 
reservoirs  according  to  their  similarities  on  any  group  of  variables 
desired.  For  example,  trophic  state  classification  would  occur  if  the 
analysis  is  confined  to  accepted  trophic  state  variables.  Or,  reservoir 
similarities  in  general  could  be  identified  when  all  water  quality 
variables  are  included  in  the  cluster  analysis. 

354.  Within  a  single  water  body,  an  ongoing  sampling  program  may 
be  designed  to  gather  data  at  specific  points  in  space  and  time.  Given 
limited  resources  for  sampling,  the  scientist  may  want  to  know  something 
about  the  redundancy  in  the  sampling  program.  In  specific  terms,  he  may 
want  to  know  how  much  information  is  lost  if  one  or  more  sampling 


stations  or  dates  are  discontinued.  When  a  cluster  analysis  is  per¬ 
formed  on  the  sampling  stations  and/or  dates  (in  consideration  of  serial 
or  spatial  correlation) ,  the  analyst  obtains  a  measure  of  similarity  (or 
redundancy)  among  stations/dates.  In  conjunction  with  other  statistical 
analyses,  the  cluster  analysis  may  be  used  to  aid  the  decision  on  sam¬ 
pling  effort  reduction. 

355.  We  might  be  interested  in  the  covariation  or  similarity 
among  variables  in  a  data  set  containing  measurements  on  several  vari¬ 
ables.  These  data  could  be  taken  within  one  or  more  water  bodies.  For 
example,  we  might  want  to  know  if  quantitative  information  on  cultural 
activities  in  the  watersheds  of  reservoirs  is  related  to  (covaries  with) 
any  of  the  measured  water  quality  variables.  Alternatively,  we  might 
want  a  single  statistical  "picture"  of  the  similarities  among  the  vari¬ 
ables  that  we  have  measured.  Cluster  analysis,  with  an  accompanying 
dendogram  for  display,  can  be  used  to  do  these  analyses. 

356.  Three  methods  of  clustering  are  available  for  the  investi¬ 
gator  using  the  latest  version  of  SAS.  Since  these  algorithms  are  typi¬ 
cal  of  those  that  are  used  in  other  statistical  computing  packages  as 
well,  we  will  focus  our  discussion  on  these  three  methods. 

357.  Clustering  of  cases  using  SAS  is  accomplished  in  a  hierar¬ 
chical  manner.  At  the  beginning,  all  cases  are  assumed  to  belong  to 
separate  clusters,  and  at  each  step  cases  (or  groups  of  cases)  are 
clustered  together  according  to  one  of  the  three  clustering  criteria. 

The  three  clustering  criteria,  or  methods,  are  the  centroid  method. 
Ward's  method,  and  average  linkage  according  to  squared  Euclidean 
distances. 

358.  Centroid  cluster  analysis  is  based  on  the  distance,  or 
similarity,  between  the  centroid  (or  mean)  of  each  cluster.  (Remember 
that  at  the  start  of  the  cluster  analysis,  each  case  is  considered  a 
separate  cluster.  By  case,  of  course,  we  mean  one  row  in  the  data 
matrix,  which  could  be  one  lake,  or  one  sampling  station,  if  the  study 
involves  a  cross  section  of  lakes  or  of  sampling  stations.)  According 
to  this  criterion,  at  each  step  the  cases  or  clusters  separated  by  the 


smallest  Euclidean  distance  are  joined  together.  The  Euclidean  distance 
Is  defined  as: 


for  the  distance  between  points  1  and  j  for  the  k1"*1  variable. 
Euclidean  distance  may  be  calculated  for  variables  In  the  original 
metric  or  for  standardized  variables;  standardization  Is  discussed  later 
In  this  section.  Note  that  correlation  between  variables  Is  not 
considered  in  Euclidean  distance. 

359.  Ward's  method  is  based  on  the  within-clusters  sum-of-squared 

deviations  (from  the  cluster  mean) .  At  each  step  the  union  of  all  pairs 
of  clusters  is  considered.  For  the  candidate  cluster  being  considered, 
the  sum  of  squared  deviations  of  the  cases  from  the  cluster  mean  is 
(calculated  here  for  the  variable) : 

n  , 

ssd  -  y.  (x.  -  xr 

i-i  1 

360.  At  each  step,  the  new  cluster  formed  is  the  candidate 
cluster  that  has  the  smallest  sum-of-squared  deviations. 

361.  In  the  average  linkage  or  group  average  method,  distance  is 
defined  as  the  average  squared  Euclidean  distance  between  pairs  of 
observations  in  a  cluster.  At  each  step,  this  distance  is  determined 
for  a  pair  of  cases  consisting  of  one  pair  from  each  cluster.  The  two 
clusters  that  are  joined  together  are  those  with  a  minimum  value  for 
this  distance  measure. 

362.  Among  the  three  hierarchical  clustering  methods  available 
with  SAS,  Ward's  method  and  the  average  linkage  method  have  been  found 
to  be  two  of  the  best  approaches.  The  centroid  method  might  be  favored 
in  certain  situations  because  it  is  more  robust  (than  the  other  two 
methods)  to  outliers  (since  it  is  based  on  the  cluster  mean,  and  not  on 


a  single  case).  It  is  also  to  be  noted  that  Ward's  method  has  a  ten¬ 
dency  to  result  in  clusters  with  roughly  equal  numbers  of  cases,  while 
the  average  linkage  method  tends  to  yield  clusters  with  approximately 
the  same  variance.  It  is  not  possible  to  unambiguously  rank  the  three 
methods  and  identify  the  best  clustering  criterion  for  all  situations. 
Therefore,  it  is  recommended  that  the  investigator  choose  one  of  the 
three  methods  to  use  on  most  problems  and  become  familiar  (through 
experience)  with  that  method.  For  problems  where  the  final  clustering 
of  cases  has  important  implications,  the  investigator  would  be  wise  to 
apply  all  three  methods  and  select  clusters  on  the  basis  of  the  combined 
results. 

363.  Prior  to  beginning  a  cluster  analysis,  the  investigator  must 
decide  if  the  analysis  is  to  be  undertaken  with  standardized  variables. 
To  standardize  a  data  set,  each  observation  is  replaced  by  the  deviation 
of  the  observation  from  the  mean,  divided  by  the  standard  deviation. 

Use  of  standardized  variables  has  the  effect  of  removing  the  influence 
of  the  units  of  measurement  from  the  results  of  the  analysis.  For 
example,  using  unstandardized  variables,  the  variance  for  a  data  set 
measured  in  micrograms  per  litre  will  be  much  higher  (10  higher)  than 
the  variance  for  the  same  data  expressed  in  milligrams  per  litre.  Thus, 
if  one  variable  in  a  multivariate  data  set  is  expressed  in  micrograms 
per  litre  and  all  other  variables  are  expressed  in  milligrams  per  litre, 
the  former  variable  is  going  to  excessively  dominate  the  variance- 
covariance  matrix.  This  in  turn  will  affect  the  results  of  all  proce¬ 
dures,  like  cluster  analysis,  that  are  based  on  the  variance-covariance 
matrix.  Alternatively,  if  variables  are  standardized  prior  to  analysis, 
the  results  of  multivariate  analyses  are  effectively  based  on  the  corre¬ 
lation  matrix.  The  standardized  variables  are  unitless,  so  any  linear 
change  in  units  will  not  affect  the  results. 

364.  While  in  most  cases  standardization  is  to  be  recommended, 
there  are  situations  where  standardization  will  adversely  affect  an 
analysis.  For  example,  within-cluster  differences  can  be  reduced  by 
standardization.  In  those  situations,  if  the  within-cluster  variance 
were  known,  standardization  should  be  based  on  that  term.  Thus,  it  is 
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recommended  that  unstandardized  variables  be  used  If  all  variances  are 
•>f  approximately  the  same  magnitude;  otherwise,  the  variables  should  be 
standardized  prior  to  analysis. 

365.  Although  some  Investigators  have  attempted  to  use  formal 
statistical  tests  on  the  results  of  a  cluster  analysis,  this  practice  is 
in  general  not  recommended.  The  major  difficulty  is  that  the  data  would 
be  asked  to  both  form  the  clusters  and  test  their  significance.  Ideally 
one  would  want  the  data  groupings  specified  beforehand;  otherwise,  the 
degrees  of  freedom  for  statistical  testing  are  affected  by  using  the 
data  to  specify  the  hypothesis  (the  clusters). 

366.  If  no  formal  statistical  inference  is  to  be  undertaken  with 
the  results  of  a  cluster  analysis,  then  there  are  no  formal  assumptions 
to  be  invoked  prior  to  the  analysis.  However,  it  is  good  practice  (as 
noted  in  the  introductory  section)  to  work  with  data  that  are  symmetri¬ 
cally  distributed  with  no  obvious  outliers.  Transformations  (selected 
from  inspection  of  univariate  and  bivariate  plots)  to  achieve  this 
symmetry  are  recommended  for  use  as  needed. 

367.  Cluster  analysis  -  example.  This  example,  like  the  previous 
one,  is  based  on  the  data  set  from  Walker  (1981)  representing  water 
chemistry  in  43  Corps  of  Engineers  reservoirs.  Here,  using  the  SAS 
CLUSTER  procedure,  we  group  the  lakes  in  clusters  on  the  basis  of  simi¬ 
larity  in  log  total  phosphorus  concentration,  log  total  nitrogen  con¬ 
centration,  log  Secchi  disk  depth,  and  log  chlorophyll  a  concentration. 
Ward's  method  was  used  for  clustering. 

368.  Table  36  contains  some  of  the  optional  output  from  the  SAS 
procedure.  The  biomodality  statistic  indicates  the  possibility  of  two 
or  more  distinct  clusters  in  the  frequency  distribution;  a  value  of 
0.555  or  greater  is  evidence  of  this  situation.  None  of  the  biomodality 
statistics  in  Table  36  is  that  large,  suggesting  that  there  is  more 
likely  a  continuum  of  change  in  the  variables  as  opposed  to  abrupt 
changes.  The  eigenvalue  statistics,  particularly  the  proportion  (of 
variance  "explained"  by  each  eigenvector) ,  indicate  the  dimensionality 
of  the  data.  When  a  high  proportion  of  the  variance  is  represented  by 
one  eigenvalue,  this  means  that  there  is  substantial  correlation  among 
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Table  36 
Cluster  Analysis 


Variable 


Blmodality 


Log (TP) 

0.421 

Log(TN) 

0.346 

Log(Secchi) 

0.382 

Log(CHL  a) 

0.382 

Eigenvalues  of  the  Correlation  Matrix 


Eigenvalue 

3.004 

0.456 

0.422 

0.118 

Number  of  Clusters 
1 
2 

3 

4 

5 

6 

7 

8 
9 

10 


Proportion  Cumulative  Proportion 

of  Variance  _ of  Variance _ 

0.751  0.751 

0.114  0.865 

0.106  0.971 

0.029  1.000 

Cubic  Clustering  Criterion 
0.000 
-0.468 
-1.548 
-1.420 
-1.710 
-1.523 
-1.500 
-1.457 
-1.574 
-1.577 
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the  variables.  For  this  example,  the  first  eigenvalue  represents 
75  percent  of  the  variance  in  the  data.  This  is  high,  and  it  suggests 
not  surprisingly,  that  the  four  variables  are  correlated.  For  our 


purposes  in  this  cluster  analysis,  it  means  that  the  clusters  created 
will  primarily  represent  the  characteristic  that  all  four  variables  have 
in  common,  which  might  be  called  "trophic  state." 

369.  The  last  section  of  Table  36  presents  the  cubic  clustering 
criterion  (CCC).  This  is  probably  the  best  available  indicator  of  the 
number  of  clusters  that  represent  the  groupings  within  the  data  set.  A 
local  peak  value  of  CCC  identifies  a  "number  of  clusters"  that  may 
define  an  acceptable  grouping  of  the  data.  For  our  example,  notice  that 
there  are  local  peaks  for  2,  4,  and  8  clusters.  On  the  basis  of  CCC, 
these  could  be  the  final  candidates  for  appropriate  data  groups  on  the 
basis  of  the  four  variables. 

370.  To  aid  in  the  selection  of  clusters  from  among  the 
identified  final  candidates,  a  tree  diagram  should  be  developed.  The 
tree  diagram,  or  dendrogram,  is  shown  in  Figure  23  for  our  example. 
Similar  observations  or  clusters  (as  defined  by  CCC  in  this  case)  are 
joined  first  (near  the  bottom).  The  higher  the  point  of  joining,  the 
less  similar  are  the  members  of  a  cluster.  SAS  output  not  shown  in  this 
manual  presents  various  similarity  or  distance  measures  that  are  recom¬ 
puted  each  time  a  new  cluster  is  formed  in  this  hierarchical  procedure. 

371.  Visual  inspection  of  Figure  23  clearly  shows  the  two-  and 
four-cluster  options,  but  only  through  scrutiny  is  the  eight-cluster 
option  identified.  Further,  since  the  clustering  variables  all 
represent  trophic  state,  we  might  select  the  four-cluster  option  as  best 
representing  these  data  (producing  four  trophic  states).  To  see  this, 
we  look  at  the  range  of  values  in  the  four  clusters  for  some  of  the 
trophic  state  variables.  Numbering  the  clusters  1  through  4  on  the 
basis  of  left-to-right  position  on  Figure  23: 

3 

a.  Total  phosphorus  concentration  (mg/m  ) 

Cluster  1:  10.2-30.7 

Cluster  2:  24.2-72.9 

Cluster  3:  40.4-131.0 

Cluster  4:  69.3-277.0 


3 

b.  Chlorophyll  a.  concentration  (mg/m  ) 

Cluster  1:  2. 4-6. 2 

Cluster  2:  2.6-10.0 

Cluster  3:  4.4-27.8 

Cluster  4:  9.5-67.1 

372.  While  there  Is  some  overlap  among  the  clusters.  It  should  be 
apparent  from  the  phosphorus  and  chlorophyll  levels  that  we  have  suc¬ 
ceeded  In  grouping  the  lakes  Into  four  trophic  states.  It  should  be 
realized  that  the  overlap  Is  less  serious  when  all  variables  are  con¬ 
sidered  simultaneously  (e.g.,  some  lakes  with  low  chlorophyll  <2  also 
have  low  Secchl  disk  depths  due  to  nonalgal  turbidity);  this,  of  course, 
is  a  reason  for  use  of  a  multivariate  (as  opposed  to  univariate) 
analysis. 

373.  The  user  of  SAS  CLUSTER  will  note  that  several  optional  mea¬ 
sures  of  similarity  or  distance  may  be  computed.  These  statistics  are 
beyond  the  scope  of  this  manual;  however,  information  on  these  measures 
may  be  obtained  either  from  the  references  on  multivariate  analysis 
identified  above,  from  Everitt  (1974),  or  from  Hand  (1981). 

374.  Examination  of  structure:  principal  components  and  factor 
analysis.  Principal  components  analysis  (PCA)  and  factor  analysis  (FA) 
are  used  to  create  a  relatively  small  number  of  new  variables  (called 
"factors")  from  a  larger  number  of  original  variables.  With  PCA,  these 
factors  are  estimated  as  simple  linear  functions  of  the  original  vari¬ 
ables;  each  factor  is  orthogonal  (at  right  angles  in  a  graphical  sense) 
to  all  other  factors.  The  practical  use  of  these  factors  may  be  based 
on  the  belief  that  the  observed  variables  in  fact  represent  only  a  small 
number  of  underlying  characteristics,  with  the  relationships  or  com¬ 
monality  among  the  observed  variables  expressed  in  the  covariance  or 
correlation  matrix.  Thus,  by  using  the  observed  variable  correlations 
to  create  a  few  common  factors,  the  investigator  may,  for  example, 
increase  his  understanding  of  the  underlying  structural  relationships 
among  reservoir  water  quality  variables. 

375.  One  problem  that  PCA  is  particularly  well  suited  to  address 
is  the  estimation  of  a  trophic  state  index.  It  is  well  accepted  that 


FW  TO  IW  ij  ■»  ws  ■  M  m  H  ■  IT  \r»  IT  IT  gH  r»  g*  7»  I'  *  gv  'J*  TVTYV'.-v  \  . 


trophic  state  (a  subjective  concept)  is  a  function  of  nutrient  concen¬ 
trations,  biomass  levels,  water  clarity,  etc.  Since  there  is  no  uni¬ 
versally  accepted,  objective  way  to  create  a  trophic  state  index  from 
these  variables,  it  may  be  reasonable  to  use  a  mathematical  procedure 
(PCA)  to  extract  the  common  element  from  these  variables  and  call  this 
common  element  a  measure  of  trophic  state.  PCA  can  thus  be  used  to 
define  a  linear  function  of  the  trophic  state-related  original  var¬ 
iables.  This  function  will  be  the  linear  function  that  "explains"  a 
maximum  (i.e.,  more  than  any  other  linear  function)  of  the  variance 
contained  in  the  original  data  set.  Since  this  "first  principal  com¬ 
ponent"  describes  (through  the  linear  function)  the  common  element  in 
the  trophic  state  data,  it  is  reasonable  to  assume  that  the  first  prin¬ 
cipal  component  is  a  good  trophic  state  index.  Further,  since  the  first 
principal  component  maximizes  the  explained  variance,  it  is  the  best 
trophic  state  index  defined  in  this  manner. 

376.  Another  advantage  of  the  use  of  PCA  to  define  a  trophic 
state  index  is  that  the  principal  component  is  also  the  linear  function 
of  the  original  variables  that  creates  a  maximum  spread  among  the  obser¬ 
vations.  Thus,  if  the  investigator  wants  to  distinguish  reservoirs  on  a 
trophic  state  basis,  the  first  principal  component  is  the  linear  func¬ 
tion  that  separates  the  cases  (e.g.,  reservoirs)  as  much  as  possible. 
This  facilitates  the  ranking  of  reservoirs  according  to  trophic  state. 

377.  PCA  and  FA  can  also  be  used  to  define  the  number  and  type  of 
underlying  factors  within  a  data  set.  For  example,  a  large  cross- 
sectional  reservoir  water  quality  data  set  may  contain  data  on  phospho¬ 
rus,  nitrogen,  chlorophyll,  Secchi  depth,  alkalinity,  pH,  conductivity, 
calcium,  magnesium,  sodium,  aluminum,  chloride,  sulfate,  and  silicon. 

It  is  likely  that  this  multivariable,  or  multidimensional,  data  set 
actually  reflects  a  much  smaller  number  of  true  structural  factors  or 
dimensions.  For  example,  the  trophic  state  variables  may  all  reflect 
one  underlying  dimension:  trophic  state.  Correspondingly,  many  of  the 
cations  and  anions  may  covary  (have  high  bivariate  correlation  coeffi¬ 
cients),  and  thus  represent  another  underlying  dimension.  These  two 
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dimensions,  or  factors,  could  be  estimated  as  linear  functions  of  the 
original  variables  using  PCA  and  FA. 

378.  It  may  be  of  interest  to  the  Investigator  to  compare  data 
sets  from  two  groups  of  reservoirs  (e.g.,  those  in  the  southeastern 
United  States  versus  those  in  the  Southwest)  to  see  if  the  underlying 
structural  relationships  among  variables  are  different.  This  could  be 
done  using  PCA  and  FA.  For  each  separate  data  set,  the  factors  could  be 
estimated.  Then  the  comparison  between  the  two  regions  would  proceed 
through  a  comparison  of  the  functional  forms  of  the  factors.  Specifi¬ 
cally,  the  investigator  asks:  are  the  individual  factors  (from  each  of 
the  separate  data  sets)  composed  of  the  same  combinations  of  variables 
with  approximately  the  same  weights  (coefficients)?  This  question  could 
be  answered  in  an  informal  manner  through  simple  inspection  of  the 
factors,  or  it  could  be  answered  with  a  formal  hypothesis  test. 

379.  Even  though  PCA  and  FA  can  be  used  to  formally  test  hypothe¬ 
ses,  most  applications  of  these  methods  are  exploratory.  In  fact,  con¬ 
firmatory  FA  involves  enough  additional  complexity  that  it  will  be 
ignored  in  this  presentation.  Thus,  it  is  assumed  here  that  all  appli¬ 
cations  of  PCA  and  FA  are  exploratory. 

380.  While  the  applications  of  PCA  and  FA  considered  herein  are 
strictly  exploratory,  it  is  still  a  good  idea  to  work  with  data  distri¬ 
butions  that  are  "well  behaved."  This  means  that,  if  possible,  data 
distributions  should  be  approximately  symmetric  without  outlying  or 
influential  points.  Ideally,  data  for  a  single  variable  should  be 
approximately  normal,  and  data  for  any  pair  of  variables  should  be 
bivariate  normal.  While  this  is  not  strictly  required,  approximate 
normality  will  lead,  in  the  long  run,  to  inferences  that  correctly 
represent  the  data.  Of  course,  transformation  should  be  considered  if 
it  is  determined  that  it  will  result  in  a  more  desirable  distribution  of 
data. 

381.  PCA  and  FA  may  be  conducted  from  either  the  correlation 
matrix  or  the  covariance  matrix.  As  noted  above  when  this  option  was 
discussed  for  other  procedures,  there  are  problems  of  scale  with  the 
covariance  matrix.  Specifically,  the  magnitude  of  the  covariance  will 


change  as  the  units  of  measurement  change.  Thus,  unless  the  investi¬ 
gator  is  working  with  variables  that  all  have  approximately  the  same 
magnitude,  it  is  recommended  that  PCA  and  FA  be  based  on  the  correlation 
matrix. 

382.  The  analysis  usually  begins  with  PCA.  Principal  components 
analysis  is  used  to  reexpress  the  original  variables  in  a  set  of 
orthogonal  factors.  Each  factor  is  a  linear  function  of  one  or  more  of 
the  original  variables.  PCA  is  a  mathematical  procedure;  it  is  used  to 
maximize  the  variance  in  the  original  data  explained  by  each  factor. 
Thus,  the  first  principal  component  is  the  linear  function  that  explains 
the  maximum  variance  in  the  original  data.  The  second  principal  compo¬ 
nent  is  orthogonal  to  the  first  component,  while  explaining  the  maximum 
of  the  remaining  variance  unexplained  by  the  first  component.  This  con¬ 
tinues  until  all  the  variance  in  the  original  data  is  explained  by  the 
orthogonal  components. 

383.  Principal  components  analysis  is  usually  conducted  to  reduce 
the  dimensionality  in  a  data  set,  or  in  other  words  to  reexpress  the 
information  contained  in  several  variables  into  a  smaller  number  of 
factors  or  components.  Thus,  it  is  common  to  retain  only  a  few  factors, 
as  PCA  is  effective  only  if  much  of  the  original  variance  is  explained 
in  a  relatively  few  factors.  In  addition,  PCA  is  generally  effective 
only  if  each  component  has  a  substantive  (i.e.,  limnological)  interpre¬ 
tation.  In  one  of  the  examples  mentioned  above,  for  example,  it  was 
desired  that  one  of  the  components  have  a  trophic  state  interpretation 
and  the  other  component  represent  cations  and  anions.  Sometimes  an 
obvious  substantive  interpretation  of  PCA  occurs;  when  it  does  not, 
however,  factor  analysis  can  be  used  to  redefine  the  factors  slightly  so 
that  interpretation  is  enhanced. 

384.  The  interpretation  of  the  factors  is  based  on  the  composi¬ 
tion  of  the  factors.  The  factors  are  linear  functions  of  one  or  more  of 
the  original  variables;  thus,  the  relative  contribution  of  these  orig¬ 
inal  variables  to  each  factor  is  the  basis  for  interpretation.  This 
contribution  is  measured  by  the  coefficient,  or  weight,  for  each  vari- 


above,  if  the  first  principal  component  is  composed  of  most  of  the  orig¬ 
inal  variables,  but  only  those  related  to  trophic  state  have  high  coef¬ 
ficients,  it  is  reasonable  to  interpret  this  component  as  having  a 
trophic  state  interpretation.  A  cation-anion  interpretation  for  the 
second  component  would  be  appropriate  if  the  second  component  was 
weighted  heavily  on  the  original  cation-anion  variables. 

385.  It  is  possible  that  PCA  will  not  result  in  easily  interpre¬ 
table  components,  particularly  since  PCA  is  an  optimizing  routine  that 
is  Ignorant  of  any  need  for  interpretation.  Factor  analysis  may  then  be 
used  as  it  allows  the  investigator  to  reorient  the  components  (now 
called  factors)  so  that  interpretation  is  facilitated.  In  a  graphical 
sense,  the  components  form  orthogonal  axes  (axes  at  right  angles  to  each 
other).  Factor  analysis  involves  rotation  of  these  axes,  and  this 
changes  the  relative  weight  each  of  the  original  variables  has  for  a 
particular  factor.  Thus,  factor  analysis  is  performed  to  create  a  new 
set  of  factors  (axes)  for  which  a  logical  grouping  of  original  variables 
has  the  highest  weights.  This  will  then  allow  a  substantive  interpreta¬ 
tion  for  the  new  factors. 

386.  The  rotation  or  creation  of  new  axes  in  factor  analysis  does 
not  have  to  involve  strictly  orthogonal  axes.  Oblique,  or  nonorthog- 
onal,  rotation  can  be  undertaken;  this  results  in  factors  that  are  cor¬ 
related  (orthogonal  PCA  and  FA  result  in  uncorrelated  factors).  Oblique 
rotation  is  considered  if  it  is  believed  that  the  underlying  structure 
involves  factors  that  are  correlated.  The  assumption,  of  course,  in 
orthogonal  PCA  and  FA  is  that  the  underlying  structure  involves  strictly 
uncorrelated  factors. 

387.  If  the  interpretation  resulting  from  PCA  is  unsatisfactory, 
and  axis  rotation  using  FA  seems  necessary,  then  it  is  recommended  that 
the  investigator  try  several  of  the  available  rotation  algorithms  avail¬ 
able  through  programs  like  SAS.  Experience  with  factor  analysis  rota¬ 
tion  for  a  particular  type  of  data  set  is  clearly  an  asset.  The  example 
presented  below  may  help  to  explain  concepts  that  are  still  confusing. 

388.  Principal  components  and  factor  analysis  -  example.  The  SAS 
procedure,  FACTOR,  was  used  on  the  seven-variable  Walker  data  set. 
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of  the  attractive  features  of  this  program  is  that  the  SAS  documentation 
is  relatively  good;  this  is  fortunate  because  the  great  number  of  avail¬ 
able  options  means,  in  effect,  that  many  of  them  cannot  be  dealt  with  in 
this  example.  The  procedures  that  are  illustrated  below,  however,  are 
the  most  commonly  used  methods. 

389.  In  this  example,  we  begin  with  PCA  and  then  apply  varimax 
rotation.  Principal  components  analysis  is  the  most  commonly  used 
method  for  creation  of  the  initial  orthogonal  components  or  factors  from 
the  original  variables.  On  occasion,  one  of  the  factor  extraction  pro¬ 
cedures  available  is  SAS,  such  as  principal  factors  (PF)  extraction, 
might  be  used.  In  brief,  PCA  works  with  all  of  the  variance  in  the 
original  data,  both  common  variance  and  unique  variance.  PF,  on  the 
other  hand,  works  only  with  common  variance.  Thus,  to  obtain  a  general 
summary  of  the  data,  PCA  is  the  preferred  choice.  Tabachnick  and  Fidell 
(1983)  provide  a  clear  explanation  of  the  differences  between  the  ini¬ 
tial  component  and  factor  extraction  options. 

390.  All  seven  variables  in  the  Walker  data  set  are  log- 
transformed  and  used  in  this  analysis.  The  variables  are:  pH,  con¬ 
ductivity,  alkalinity,  phosphorus  concentration,  nitrogen  concentration, 
Secchi  disk  depth,  and  chlorophyll  a. 

391.  Table  37  contains  a  summary  of  the  PCA.  For  PCA,  the  com- 
munality  referred  to  in  Table  37  is  always  one;  for  PF,  the  communality 
will  lie  between  zero  and  one,  and  represents  the  common  variance  among 
variables.  The  eigenvalues  in  Table  37  indicate  the  amount  of  variance 
in  the  original  seven  variables  that  is  explained,  or  represented,  by 
each  of  the  orthogonal  (perpendicular)  components.  For  this  example,  we 
see  that  the  first  principal  component  explains  59.82  percent  of  the 
variance,  and  the  first  three  components  explain  86.99  percent  of  the 
variance.  This  suggests  that  the  variability  in  the  original  seven 
variables  might  be  reasonably  summarized  in  perhaps  three  orthogonal 
components. 


Factor 

1 

2 

3 

4 

5 

6 

7 

Eigenvalue 

4.188 

1.098 

0.804 

0.405 

0.318 

0.136 

0.052 

Proportion 
of  variance 

0.598 

0.157 

0.115 

0.058 

0.045 

0.019 

0.007 

Cumulative 

proportion 
of  variance 

0.598 

0.755 

0.870 

0.928 

0.973 

0.993 

1.000 

392.  The 

factor 

pattern 

matrix  in 

Table  38 

is  a  matrix  of 

corre- 

lation  coefficients  between  the 

original 

seven  variables 

and  the 

seven 

factors.  For  this  example,  note  that  the  correlations  between  factor  1 
and  six  of  the  seven  variables  are  relatively  high.  This  means  that  the 
first  factor  (the  first  principal  component)  is  a  good  summary  descrip¬ 
tor  of  essentially  all  of  the  variables  (particularly  phosphorus,  alka¬ 
linity,  and  chlorophyll).  If  it  was  our  objective  to  create  a  single 
(linear)  index  value  for  each  reservoir  summarizing  the  measurements  on 
these  variables  within  that  reservoir,  then  the  first  factor  in  Table  38 
is  a  good  choice.  We  should  then  produce  a  matrix  of  standardized  fac¬ 
tor  scoring  coefficients.  These  are  analogous  to  standardized  regres¬ 
sion  coefficients.  For  each  reservoir,  to  calculate  the  score  on 
factor  1,  the  scoring  coefficient  for  each  variable  is  multiplied  times 
the  standardized  (subtract  the  mean  and  divide  by  the  standard  devia¬ 
tion)  value  of  that  variable;  these  terms  are  then  summed.  Thus,  for  a 
particular  reservoir: 

PCA1  =  Zb  z 

where 

PCAj  =  calculated  value  for  the  first  principal  component 
b^  =  factor  scoring  coefficient  for  variable  i 
z^  =  standardized  value  for  variable  i 
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Table  38 

Principal  Components  and  Factor  Analysis:  Factor  Pattern  Matrix 


Factor 


Variable 

1 

2 

3 

4 

5 

6 

7 

pH 

0.693 

0.675 

-0.136 

0.048 

-0.110 

-0.133 

0.119 

Log(COND) 

0.553 

0.017 

0.815 

-0.125 

0.111 

-0.015 

0.033 

Log (ALK) 

0.859 

0.411 

0.007 

-0.146 

-0.195 

0.115 

-0.144 

Log (TP) 

0.878 

-0.255 

-0.242 

-0.199 

0.084 

0.220 

0.104 

Log(TN) 

0.741 

-0.430 

0.127 

0.407 

-0.290 

0.019 

0.015 

Log(SECCHI) 

-0.790 

0.464 

0.173 

0.273 

0.026 

0.236 

0.036 

Log(CHLa) 

0.850 

0.093 

-0.127 

0.293 

0.404 

-0.023 

-0.060 

393.  For  this  example,  however,  our  objective  is  to  create  com¬ 
ponents  or  factors  that  can  be  given  a  water  quality  interpretation. 

With  that  concern  in  mind,  factor  1  is  less  attractive  since  it  seems  to 
represent  all  of  the  variables.  On  the  other  hand,  the  factor-variable 
correlations  in  Table  38  suggest  that  factor  2  is  primarily  an  indicator 
of  pH,  and  factor  3  is  primarily  an  indicator  of  conductivity.  Further, 
factors  4  through  7  seem  relatively  unimportant  on  the  basis  of  both  the 
factor-variable  correlations  and  the  eigenvalues.  Thus,  we  request  a 
factor  rotation  retaining  only  the  first  three  factors. 

394.  As  noted  above,  factor  rotation  is  often  used  to  reorient 
the  factors  with  respect  to  the  original  variables  so  that  substantive 
interpretation  is  facilitated.  Varimax  (orthogonal)  rotation  is  the 
most  commonly  used  procedure,  so  it  is  used  here.  In  essence,  the 
varimax  procedure  increases  the  effect  of  a  variable  on  a  factor  for 
those  variables  that  are  highly  correlated  with  the  initial  components, 
and  decreases  the  effect  for  those  variables  that  are  not  highly  cor¬ 
related  with  the  initial  components. 

395.  Table  39  contains  the  results  of  the  varimax  rotation.  The 
orthogonal  transformation  matrix  in  Table  39  converts  the  first  three 


principal  components  into  the  three  new  orthogonal  factors  created  using 


Table  39 

Principal  Components  and  Factor  Analysis:  Varimax  Rotation 


Orthogonal  Transformation  Matrix 


Factor  1  2 


.721 

2  -0.646 

3  -0.250 

Rotated  Factor  Pattern  Matrix 


Variable 

1 

Factor 

2 

pH 

10 

97  * 

Log(COND) 

18 

20 

Log(ALK) 

35 

84  * 

Log (TP) 

86  * 

39 

Log(TN) 

78  * 

10 

Log(SECCHl) 

-90  * 

-16 

Log (CHLa) 

58  * 

62  * 

Variable 


Log(COND) 

Log(ALK) 


Log(SECCHI) 

Log(CHLa) 


Communality  Estimates 


mi 
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varimax.  The  rotated  factor  pattern  yields  the  new  matrix  of  factor- 
variable  correlations;  the  correlations  are  multiplied  by  100,  and  the 
highest  values  are  marked  with  an  asterisk  for  ease  of  interpretation. 
The  communalities  (squared  correlations)  indicate  how  much  of  the 
variance  for  each  of  the  seven  variables  is  shared  with  the  other  six. 
Finally,  standardized  scoring  coefficients  are  presented  in  Table  40, 
permitting  calculation  of  factor  scores  using  the  standardized  variables 
as  shown  above. 

396.  The  results  in  Table  39  indicate  that  we  have  achieved  our 
objective  of  interpretable  factors.  Based  on  the  rotated  factor  pattern 
(the  factor-variable  correlations)  in  Table  39,  the  first  factor  is 
effectively  a  trophic  state  index,  describing  primarily  the  trophic 
state  variables  phosphorus,  nitrogen,  Secchi  disk  depth  and,  to  a  lesser 
extent,  chlorophyll  a.  The  second  factor  is  an  indicator  of  acidity,  as 
it  is  most  highly  correlated  with  pH  and  alkalinity  (and  to  a  lesser 
extent,  chlorophyll  a).  The  third  factor  is  an  indicator  of  dissolved 
solids  or  salinity,  as  it  is  largely  a  function  of  conductivity.  Thus, 
we  have  created  three  orthogonal  composites  of  the  original  seven  vari¬ 
ables,  and  each  of  the  three  factors  has  a  clear  substantive  meaning. 

The  standardized  scoring  coefficients  could  then  be  used  in  conjunction 
with  standardized  variables  to  calculate  values  for  these  trophic  state, 


acidity,  and  salinity  factors  for  each  reservoir. 

397.  Predicting  group  membership:  discriminant  analysis.  Dis¬ 
criminant  analysis  is  used  to  define  a  linear  function  of  predictor 
variables  that  may  be  employed  to  predict  group  membership  for  a  partic¬ 
ular  case  (e.g.,  lake).  The  dependent  variable  in  discriminant  analysis 
is  categorical,  as  it  identifies  group  membership.  Conceptually,  it  is 
helpful  to  think  of  discriminant  analysis  as  somewhat  analogous  to 
regression  analysis;  both  procedures  are  often  employed  to  define  a 
linear  relationship  that  may  be  used  to  predict  the  value  of  a  dependent 
variable.  In  discriminant  analysis,  this  dependent  variable  is  categor¬ 
ical  (e.g.,  trophic  state),  while  in  regression  analysis  the  dependent 
variable  is  frequently  continuous  (e.g.,  phosphorus  concentration). 


Table  40 

Principal  Components  and  Factor  Analysis: 
Standardized  Scoring  Coefficients 


Variable 

Factor 

1 

2 

3 

pH 

-0.235 

0.604 

-0.118 

Log(COND) 

-0.168 

-0.101 

1.004 

Log(ALK) 

-0.096 

0.411 

0.067 

Log (TP) 

0.377 

0.010 

-0.216 

Log(TN) 

0.341 

-0.220 

0.212 

Log (SECCHI) 

-0.463 

0.165 

0.138 

Log(CHLa) 

0.131 

0.220 

-0.088 

398.  For  example,  discriminant  analysis  can  be  used  to  develop  a 
model  for  the  prediction  of  trophic  state.  To  do  this,  the  investigator 
can  use  a  cross-sectional  data  set  of  lakes  and  reservoirs,  containing 
data  on  trophic  state  predictor  variables  (nitrogen,  phosphorus,  water 
clarity,  Secchi  disk  depth,  etc.)  along  with  the  trophic  state  classifi¬ 
cation  for  each  lake  or  reservoir.  Discriminant  analysis  is  then 
employed  to  estimate  a  linear  function  of  the  predictor  variables  that 
best  classifies  the  lakes  and  reservoirs  into  their  preassigned  classes 
(trophic  states).  Once  this  model  is  estimated,  it  may  then  be  used 
predict  the  trophic  state  for  a  new  lake  or  reservoir,  on  the  basis  of 
estimates  of  the  predictor  variables. 

399.  Of  course,  trophic  state  is  not  the  only  limnological  clas¬ 
sification  scheme  for  which  it  would  be  useful  to  have  a  predictive 
model.  As  noted  in  Reckhow  and  Chapra  (1983),  it  might  be  of  value  to 
develop  a  model  for  the  prediction  of  oxic  versus  anoxic  status  in 
lakes,  or  perhaps  a  model  for  the  prediction  of  expected  dominant  algal 
type  on  the  basis  of  nutrient  and  aquatic  chemistry. 

400.  While  the  development  of  a  predictive  model  is  the  primary 
objective  in  most  applications  of  discriminant  analysis,  other  useful 


information  is  generated  from  the  application  of  the  procedure.  Use  of 
discriminant  analysis  shows  the  investigator  which  variables  are  most 
important  in  the  prediction  of  group  membership.  It  also  indicates  how 
effectively  one  can  predict  group  membership  by  providing  an  assessment 
of  the  proportion  of  misclassified  cases  in  a  data  set.  Finally,  the 
discriminant  functions  (or  classification  functions)  can  be  used  to 
create  a  new  function  that  provides  an  estimate  of  the  probability  that 
a  case  (e.g.,  reservoir)  belongs  in  one  of  the  predefined  groups.  This 
probability  estimate  is  often  a  particularly  informative  way  of  expres¬ 
sing  the  confidence  one  might  have  in  the  prediction  of  group  member¬ 
ship.  For  an  excellent  treatment  of  the  theory  and  application  of 
discriminant  analysis,  see  Tabachnick  and  Fidell  (1983). 

401.  For  inferences  and  group  membership  predictions  to  be  appro¬ 
priate,  some  conditions  are  recommended  and  some  statistical  assumptions 
are  technically  required.  As  with  all  of  the  multivariate  statistical 
methods,  outliers  can  adversely  affect  the  results  of  discriminant  anal¬ 
ysis.  Therefore,  it  is  recommended  that  the  investigator  examine  histo¬ 
grams  and  bivariate  plots  to  check  for  outlying  data  points. 
Transformations  should  be  used  if  necessary  to  reduce  the  impact  of  out¬ 
liers.  (See  Parts  II  and  III  for  additional  guidance  on  the  treatment 
of  outliers  and  influential  data  points.) 

402.  The  primary  statistical  assumptions  for  the  application  of 
discriminant  analysis  are  that  the  predictor  variables  are  distributed 
according  to  a  multivariate  normal  distribution  within  each  group  and 
that  the  variance-covariance  matrices  are  constant  across  groups.  The 
normality  assumption  can  often  be  effectively  assessed  by  checking  for 
bivariate  normality  for  any  pair  of  predictor  variables,  while  the 
variance-covariance  matrices  can  often  be  compared  by  eye  as  these 
matrices  are  routinely  calculated  in  most  discriminant  analysis  pro¬ 
grams.  Informal  checks  on  these  assumptions  are  often  adequate,  as  the 
results  are  fairly  robust  to  violations.  This  is  particularly  true  if 
the  smallest  group  contains  about  20  cases  or  more  and  there  are  only  a 
few  predictor  variables  in  the  discriminant  function  (Tabachnick  and 
Fidell  1983).  Transformations,  of  course,  can  be  used  if  there  is 
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concern  over  the  violation  of  one  of  these  assumptions. 

403.  If  there  are  more  than  two  groups  and  more  than  two  pre¬ 
dictor  variables,  more  than  one  discriminant  function  may  be  estimated. 
Like  principal  components,  each  discriminant  function  is  orthogonal  to 
all  others,  and  the  first  discriminant  function  is  the  most  effective 
linear  predictor  of  group  membership  (with  additional  functions  being 
successively  less  effective).  Generally,  only  one  or  a  few  of  the  dis¬ 
criminant  functions  are  retained  for  the  prediction  of  group  membership. 
A  statistical  test  can  be  conducted  to  assess  the  ability  of  each  of  the 
discriminant  functions  to  determine  group  membership. 

404.  Discriminant  analysis  -  example.  For  this  example,  we  look 
at  a  cross-sectional  model  to  be  used  to  predict  presence  or  absence  of 
fish  species  in  a  lake.  The  data  set  consists  of  32  lakes  in  the 
Adirondacks  of  New  York  State.  The  dependent,  or  classification,  vari¬ 
able  is  the  observation  of  presence  (1)  or  absence  (0)  of  brook  trout  in 
each  lake,  assessed  during  fish  surveys.  The  predictor  variables  are  pH 
and  aluminum  concentration  (log-transformed) . 

405.  Using  SAS,  PROC  DISCRIM  was  run  using  the  option  POOL-TEST; 
this  tests  for  equality  of  the  covariance  matrices.  If  the  test  results 
are  not  significant,  the  pooled  covariance  matrix  is  used  in  the  cal¬ 
culations.  SAS  prints  the  discriminant  function  coefficients  only  when 
this  pooled  covariance  matrix  is  used,  so  this  is  an  important  option  to 
consider.  P00L=YES  causes  SAS  to  bypass  the  significance  test  and  auto¬ 
matically  use  the  pooled  covariance  matrix. 

406.  Several  optional  statistics  may  be  requested  from  SAS, 
including  correlation  and  covariance  matrices.  For  this  example. 

Table  41  presents  the  within-group  correlation  matrix.  This  lists  the 
bivariate  correlation  coefficients  (and  their  significance  levels)  for 
all  pairs  of  variables,  calculated  separately  for  the  data  within  each 
of  the  predefined  groups. 

407.  Next  of  interest  in  the  SAS  output  is  the  pairwise  squared 
generalized  distance  (or  Mahalanobis  distance)  between  groups.  This  is 
the  distance  between  group  centroids  (based  on  the  group  means  for  each 
variable)  scaled  by  the  within-groups  covariance  matrix.  For  this 


Table  41 

Discriminant  Analysis:  Within  Correlations 


Fish  Absent 

Fish  Present 

PH 

Log(AL) 

_ PH 

Log(AL) 

PH 

1.000 

-0.901 

(<0.001) 

-0.514 

(0.035) 

Log(AL) 

-0.901 

(<0.001) 

1.000 

-0.514 

(0.035) 

1.000 

Pairwise  squared  generalized  distance  *  1. 

572 

example , 

this  distance  (D2(I/J)) 

is  estimated 

as  1.572. 

Since 

D2  is  a 

measure  of  separation  between  the  two  groups,  it  is  of  interest  to  test 

the  significance  of  this  difference.  The  test  is  based  on  conversion  of 
2 

D  to  an  F  statistic  according  to  (Green  1978) : 

m0mi <m0  +  m^  -  n  -  1)  ^ 

n(mQ  +  mjMmQ  +  -  2)  D  F(n,m0+mj-n-l) 

where 

m^  *  the  number  of  observations  in  group  0  (absence) 
m^  -  the  number  of  observations  in  group  1  (presence) 
n  *  the  number  of  predictor  variables. 

For  the  example  (with  m^  **  15  and  m^  ■  17  ),  the  F  statistic  is  6.05. 
For  2,29  degrees  of  freedom,  the  separation  of  the  groups  is  significant 
at  better  than  the  0.01  level. 

408.  Table  42  presents  the  coefficients  for  the  linear  discrimi¬ 
nant  functions.  They  may  be  written  as: 


dfQ  -  -109.082  +  26, 277pH  +  15.880  log(Al) 
df x  =  -116.772  +  27. 885pH  +  15.618  log(Al) 


Table  42 

Discriminant  Analysis:  Coefficients  for  the  Linear 
Discriminant  Function 


I 


Constant 

pH 

Log(AL) 


Fish  Absent 
-109.082 
26.277 
15.880 


Fish  Present 
-116.772 
27.885 
15.618 


409.  These  functions  are  used  to  classify  cases  as  follows.  For 
each  case,  the  predictor  values  are  substituted  into  the  equations  above 
and  observation-specific  values  for  df^  and  df^  are  calculated.  A 
case  is  then  classified  into  the  group  yielding  the  higher  value  for 

df  .  For  example,  if  lake  x  has  pH  =  4.8  and  A1  =  300  pg/f,  then 

dfQ  =  56.384 
dfL  =  55.764 

410.  According  to  this  criterion,  lake  x  would  be  classified  in 
group  0  (fish  absence). 

411.  Table  43  presents  a  summary  of  the  classification  success  of 
the  discriminant  analysis  model.  The  rows  in  the  2x2  table  identify 
the  actual  group  membership  for  each  case,  and  the  columns  identify  the 
predicted  group  membership  (based  on  the  discriminant  analysis).  A 
perfect  classifier  would  have  nonzero  entries  along  the  upper-lef t-to- 
lower-right  diagonal  and  zeros  in  all  other  cells  of  the  table.  For 
example,  in  Table  43,  12  and  3  are  the  entries  in  the  upper  row.  This 
means  that  12  of  15  observed  group  0  lakes  are  classified  correctly  as 
group  0  while  3  are  classified  incorrectly  as  group  1.  A  similar  inter¬ 
pretation  holds  for  the  second  row.  Summing  along  the  diagonal,  we  see 


T 


TT-TTlmiiT 


X 


Fish 

Absent 

Fish 

Present 

Total 

Fish  absent 

12 

3 

’  15 

80.00 

20.00 

100.00 

Fish  present 

6 

11 

17 

35.29 

64.71 

100.00 

Total 

18 

14 

32 

56.25 

43.75 

100.00 

that  12  +  11  =  23  cases  are  correctly  classified  and  32  -  23  =  9  cases 
are  incorrectly  classified. 

412.  In  Table  44  all  cases  are  classified  according  to  the  dis¬ 
criminant  function  model  using  the  generalized  distance  measure  in  an 
exponential  expression.  The  exponential  equation  has  range  zero  to  one, 
so  it  is  given  a  probabilistic  interpretation.  Thus,  for  each  case,  the 
probability  of  membership  in  each  group  is  calculated,  and  the  case  is 
classified  into  the  group  for  which  the  probability  is  the  highest. 

These  probabilities  and  classifications  are  presented  in  Table  44. 

413.  The  probability  equation  in  Table  44  is  cumbersome  because 
it  is  based  on  the  generalized  distance  of  a  case  from  each  group  mean. 
Fortunately,  probabilities  can  be  calculated  more  easily  from  the 
discriminant  functions  according  to: 


P(0|df)  ”  posterior  probability  of  group  0  classification 
qQ  ”  prior  probability  of  group  0  classification 
q.  *  prior  probability  of  group  1  classification 


414.  For  now,  assume  that  the  prior  probabilities  are  equal;  this 
means  that  we  have  no  a  priori  belief  that  a  particular  case  belongs  in 
one  group  or  the  other.  In  that  case,  q ^ “  1  •  For  the  example 
presented  above  (pH  =  4.8;  A1  =  300  yg/i),  the  probabilities  are: 

P(0|df)  =  0.650 
P ( 1 | df )  =  1  -  P(0 | df )  =  0.350 

415.  Thus,  there  is  a  0.65  chance  that  the  lake  is  properly  clas¬ 
sified  in  group  0  (fish  species  absent). 

416.  Since  the  equation  allows  for  prior  probabilities,  we  could 
assign  prior  probabilities  to  each  group  reflecting  the  relative  propor¬ 
tion  of  the  relevant  population  that  belongs  to  each  of  the  groups.  For 
this  example,  15  of  the  32  observations  in  the  sample  are  in  group  0 
(absence)  and  17  of  the  32  observations  are  in  group  1  (presence). 

These  proportions  can  be  used  as  the  prior  probabilities.  Thus, 

qQ  *  15/32  =  0.469 

q2  =  17/32  =  0.531 

417.  Using  these  prior  capabilities  in  the  equation  above,  the  new 
posterior  probabilities  are: 

P (0  | df ,q)  =  0.621 
P ( 1 1 df ,q)  -  1  -  P(0|df,q)  -  0.379 

418.  Notice  that,  as  should  be  expected,  the  higher  prior  probabil¬ 
ity  for  group  1  resulted  in  a  drop  in  the  posterior  probability  for 
group  zero,  in  comparison  to  the  analysis  when  the  prior  probabilities 
were  not  explicitly  included.  Use  of  a  prior  probability  can  be  partic¬ 
ularly  helpful  when  the  groups  are  quite  different  in  size. 


419.  In  situations  where  the  group  sizes  are  different,  the  "naive" 
classification  criterion  (without  consideration  of  predictor  variables) 
would  assign  all  cases  to  the  group  with  the  largest  number  cases. 

Thus,  if  80  percent  of  all  cases  correctly  belong  in  group  1,  the  naive 
classifier  would  place  all  cases  in  group  1  and  have  an  80-percent  clas¬ 
sification  success  rate.  This  "maximum  chance  criterion"  (Morrison 
1969)  is  appropriate  for  evaluating  the  success  of  discriminant  func¬ 
tions  (beyond  that  by  chance)  if  one  desires  to  maximize  the  proportion 
of  cases  correctly  classified.  Alternatively,  if  the  objective  is  to 
correctly  classify  cases  into  both  groups,  the  appropriate  criterion  is 
based  on  proportional  chance.  This  is  calculated  as  (Morrison  1969): 

Cpro  "  {q0)2  +  (ql)2 

420.  For  our  example,  this  is 

C  =  (15/32) 2  +  (17/32) 2  -  0.502 
pro 

whereas  the  maximum  chance  criterion  is 

C  =  17/32  =  0.531 
max 

421.  Since  our  objective  in  this  example  would  probably  be  to  clas¬ 
sify  cases  correctly  in  both  groups,  the  proportional  chance  criterion 

is  appropriate.  According  to  that  criterion,  our  model  does  better  than 
chance  if  it  classifies  more  than  50  percent  of  the  cases  correctly.  In 
fact,  for  the  model  development  data  set.  Table  43  indicates  that  23/32 
■  0.719,  or  about  72  percent  of  the  cases  were  correctly  classified  by 
the  discriminant  function  model. 

422.  It  must  be  emphasized  that  the  classification  success  of  the 
model  should  actually  be  evaluated  using  a  data  set  that  is  different 
from  the  model  development  data  set.  It  is  to  be  expected  that  a  model 
developed  from  a  particular  data  set  will  yield  an  overly  optimistic 
classification  success  rate  when  evaluated  using  the  model  development 
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data  set.  Thus,  a  separate  data  set  should  be  used  to  provide  an 
unbiased  estimate  of  the  classification  error  rate.  Alternatively, 
various  measures  of  cross  validation  (Green  1978)  may  be  employed  to 
estimate  classification  error. 


PART  V:  SAMPLING  PROGRAM  DESIGN 


Introduction 

423.  In  undertaking  a  study,  an  investigator  will  generally  have 
as  his  objective  either  the  estimation  of  some  parameter  or  the  compari¬ 
son  of  several  different  populations.  Since  sampling  is  the  only  prac¬ 
tical  method  of  carrying  out  most  studies,  the  researcher  is  immediately 
faced  with  several  problems.  The  eventual  discussion  of  the  whole  popu¬ 
lation  from  a  sample  involves  statistical  inference.  This  means  that 
the  true  value  of  the  population  parameter  will  never  be  known,  only  an 
approximation  or  estimate  of  that  parameter.  It  is  necessary  to  obtain 
these  approximations  as  accurately  and  as  precisely  as  possible. 

424.  Accuracy  implies  that  an  estimate  of  a  parameter  will,  on 
the  average,  be  centered  on  the  true  population  parameter  and  will  not 
be  shifted  up  or  down.  Estimates  that  have  a  consistent  tendency  to 
overestimate  or  underestimate  a  population  parameter  are  inaccurate  and 
are  said  to  be  "biased." 

425.  Precision  is  an  indication  of  the  reliability  of  an  estimate 
and  refers  to  the  variability  between  repeated  measures  of  the  same 
quantity.  All  estimates  of  parameters  will  have  some  variability,  but 
the  lower  the  variability,  the  higher  the  precision. 

426.  The  major  objective  in  sampling  program  design  is  to  obtain 
as  accurate  or  unbiased  an  estimate  as  possible,  and  at  the  same  time 
reduce  or  explain  as  much  of  the  variability  as  possible  in  order  to 
improve  the  precision  of  the  estimates. 

427.  A  major  concern  in  the  design  of  a  sampling  program  deals 
with  the  problem  of  practicality.  Measuring  the  whole  population  is 
impractical.  The  sampling  scheme  should  provide  an  estimate  that  is  as 
accurate  and  precise  as  possible,  even  though  the  sample  may  be  a  very 
small  fraction  of  the  whole.  For  instance,  the  objective  may  be  to 
determine  the  average  phosphorus  concentration  of  a  reservoir.  The  sam¬ 
ple  may  be  only  a  few  litres  of  the  millions  £>f  cubic  metres  of  water  in 
the  reservoir.  The  number  of  samples  would  be  small,  but  with  proper 
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placement  and  an  adequate  number  of  samples,  a  good  estimate  could  still 
be  obtained  to  meet  the  study  objectives. 

428.  Another  concern  in  sampling  design  is  cost.  More  samples 
are  better,  but  cost  increases  proportionally  with  the  number  of  sam¬ 
pling  trips  made,  number  of  sites  sampled,  and  the  number  of  different 
analyses  performed.  The  increased  number  of  trips  produces  a  more  pre¬ 
cise  estimate.  Unfortunately,  the  increase  in  trips  or  analyses  is  not 
directly  proportional  to  the  Increase  in  precision,  so  that  doubling  the 
number  of  sampling  trips  does  not  double  the  precision.  The  sampling 
program  must  be  designed  to  achieve  the  optimal  allocation  of  the  sam¬ 
ples.  An  optimal  allocation  will  be  both  practical  and  cost  effective 
by  striking  a  balance  between  how  many  samples  are  needed  and  how  many 
samples  are  within  budget.  If  funds  are  severely  limited,  the  inves¬ 
tigator  may  also  be  faced  with  a  decision  on  the  feasibility  of  the 
study.  The  question  is  simply:  "Will  the  results  that  can  be  obtained 
with  the  available  funds  produce  estimates  which  are  sufficiently  pre¬ 
cise  to  meet  the  study  objectives?" 

429.  The  critical  element  in  designing  a  sampling  program  is  the 
understanding  of  variability  in  both  the  samples  and  the  target  popula¬ 
tion.  If  a  lake  had  no  variability  in  phosphorus  concentration,  one 
sample  from  the  most  convenient  site  would  provide  an  adequate  measure 
for  the  whole  lake.  However,  if  variability  exists  (and  It  always 
does),  some  statistical  inference  procedure  is  required. 


Study  Objectives 

430.  In  order  to  ensure  that  the  sampling  scheme  is  adequate  and 
that  it  will  provide  the  desired  information,  It  is  necessary  to  state 
the  study  objectives  clearly.  Sampling  is  facilitated  by  specifying  the 
narrowest  possible  set  of  objectives  which  will  provide  the  desired 
information.  Several  points  that  should  be  specified  in  the  study 
objectives  are  discussed  below. 

431.  Target  population  definition  is  the  first  step,  since  the 
sample  must  be  drawn  from  the  target  population.  A  population,  in  a 


statistical  sense,  can  be  defined  as  the  Bet  of  all  possible  values  of 
the  variable  of  interest  which  might,  or  do,  exist.  The  target  popula¬ 
tion  Is  a  limited  subset  of  the  population  and  is  simply  the  population 
about  which  statistical  inferences  are  to  be  made.  The  limits  of  the 
target  population  are  defined  by  the  objectives  of  the  study.  Examples 
of  target  populations  include  dissolved  oxygen  concentrations  in  reser¬ 
voir  releases  under  different  operating  regimes,  phytoplankton  abundance 
before  and  after  some  environmental  change,  or  the  average  phosphorus 
concentrations  in  a  group  of  reservoirs.  Individual  measurements  of  the 
population  of  interest  are  called  observations,  and  the  population 
parameter  being  measured  is  referred  to  as  the  variable.  This  defini¬ 
tion  will  not  only  describe  what  is  to  be  sampled,  but  where  and  when. 
All  information  that  limits  the  population  to  be  sampled  should  be 
Included. 

432.  The  reason  for  limiting  the  target  population  definition  is 
that  much  of  the  variability  which  exists  is  not  of  interest.  If  dis¬ 
solved  oxygen  concentrations  in  reservoir  releases  are  to  be  sampled, 
the  Investigator  need  not  be  concerned  with  dissolved  oxygen  concentra¬ 
tions  in  the  rivers  and  streams  entering  the  reservoir,  which  would  add 
an  additional  source  of  variability.  The  first  step  then  in  addressing 
the  problem  of  variability  is  to  obtain  a  clearly  defined  set  of  objec¬ 
tives,  which  in  turn  Includes  a  definition  of  the  target  population. 

433.  For  example,  suppose  the  objective  is  to  sample  dissolved 
oxygen.  All  oxygen  everywhere?  No,  obviously  not.  So  the  definition 
of  the  target  population  is  refined,  and  with  each  new  element  of  the 
definition  the  investigator  derives  a  more  homogeneous  (less  variable) 
target  population — oxygen  concentrations  in  reservoirs,  reservoirs  in 
the  United  States,  in  the  southeastern  United  States,  etc.  Eventually 
the  target  population  may  be  reduced  to  a  certain  size  or  particular 
type  of  reservoir.  Sampling  may  even  be  restricted  to  a  particular  body 
of  water,  to  particular  portions  of  that  body  of  water,  and  even  to 
particular  depths.  The  definition  depends  entirely  on  the  objectives  of 
the  study. 

434.  When  the  study  objective  is  to  test  for  differences,  it  is 


not  uncommon  to  have  some  specific  reservoirs  that  are  to  be  contrasted. 
In  this  case,  the  definition  of  the  target  population  is  simplified. 
However,  if  the  target  population  calls  for  the  comparison  of  two  or 
more  types  of  reservoirs  in  a  geographical  area,  the  reservoirs  chosen 
should  represent  a  random  sample  of  all  the  reservoirs  which  fall  into 
each  of  the  categories  of  interest. 

435.  The  definition  of  the  target  population  above  has  only  con¬ 
sidered  the  spatial  limits  of  the  study.  Temporal  limits  must  also  be 
defined.  Should  the  oxygen  concentration  be  sampled  during  the  full 
annual  cycle,  several  annual  cycles,  or  perhaps  during  only  part  of  the 
annual  cycle?  Again  this  depends  on  the  study  objectives.  Is  the  lake 
to  be  characterized  for  one  annual  cycle  (any  year) ,  or  does  the  study 
call  for  estimates  of  variability  between  years?  Perhaps  the  study 
calls  for  an  evaluation  of  oxygen  conditions  in  the  summer,  when  hypo- 
limnetic  anoxia  is  expected  to  occur. 

436.  The  definition  of  the  target  population  must  be  broad  enough 
to  satisfy  the  study  objectives,  but  the  more  narrow  the  objectives  can 
be  made  the  less  variability  there  is  to  be  taken  into  consideration. 
Once  the  target  population  and  scope  of  the  study  have  been  defined,  the 
investigator  should  be  able  to  state  which  variable  or  variables  will  be 
measured.  He  should  also  be  able  to  define  the  spatial  and  temporal 
limits  of  the  experiment. 

437.  Problem  identification  requires  a  decision  on  the  nature  of 
the  final  goal.  To  begin  with,  do  the  researchers  wish  to  estimate  a 
population  parameter,  such  as  reservoir  phosphorus  concentration,  or  do 
they  wish  to  test  a  hypothesis?  These  two  procedures  are  not  mutually 
exclusive,  so  the  objectives  may  involve  both  estimation  and  hypothesis 
testing.  However,  one  will  usually  be  chosen  as  a  primary  objective, 
since  the  sampling  allocations  may  differ  for  the  two  goals.  For  exam¬ 
ple,  a  balanced  design  is  a  desirable  trait  in  hypothesis  testing  while 
an  estimation  of  a  parameter  may  involve  sample  allocation  which  is 
unbalanced  but  will  take  into  consideration  differing  variable  levels  in 
different  areas. 

438.  The  last  step  in  outlining  the  sampling  objectives  is  to 


define  exogenous  variables  Co  be  measured  and/or  define  strata.  The 
purpose  of  Including  the  measurement  of  covariables  (quantitative  vari¬ 
ables)  and  categorical  variables  or  strata  (qualitative  variables)  is  to 
reduce  variability  and  increase  precision.  For  example,  primary  produc¬ 
tivity  is  obviously  dependent  on  incident  light  conditions  and  the 
attenuation  of  light  with  depth.  Therefore,  in  a  study  of  primary  pro¬ 
ductivity,  light  is  an  influencing  variable  which  should  be  taken  into 
consideration  even  though  its  measurement  may  not  be  part  of  the  study 
objectives.  This  exogenous  variable  can  be  used  to  explain  part  of  the 
variability  in  primary  productivity,  and  to  therefore  increase  the  pre¬ 
cision  of  the  estimates. 

439.  Stratification,  or  the  use  of  categorical  variables,  serves 
a  similar  purpose  (reducing  variability  in  the  estimates)  in  a  different 
manner.  In  stratification,  the  area  or  time  frame  of  sampling  is  sub¬ 
divided  into  smaller  more  homogeneous  units  which  will  have  a  small 
amount  of  variability  within  the  unit.  For  example,  primary  productiv¬ 
ity  varies  with  season.  If  primary  productivity  over  an  annual  cycle  is 
examined,  there  will  be  a  large  degree  of  variability.  However,  if  each 
of  the  seasons  is  examined  individually,  there  will  be  much  less  vari¬ 
ability.  Therefore,  summer  productivity  may  range  from  0.3  to  0.5  g 
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C  m  day  ,  and  relatively  precise  estimates  can  be  obtained.  Rela¬ 
tively  precise  estimates  of  the  winter  productivity,  which  might  range 
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from  0.01  to  0.1  g  C  m  day  ,  may  also  be  obtained.  However,  if  sea¬ 
sons  are  ignored,  the  result  is  a  single  less  precise  estimate  ranging 
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from  0.01  to  0.5  g  C  m  day  .  Obviously,  greater  precision  results 
from  estimating  the  mean  productivity  separately  within  each  of  the 
strata  (seasons)  than  if  strata  are  ignored.  The  same  is  true  for 
areas  that  may  have  different  mean  values  of  the  target  population. 

When  the  areas  are  combined,  the  variability  will  be  greater  than  if  the 
areas  are  separated  as  strata. 

440.  Any  consideration  in  the  sampling  program  which  serves  to 
reduce,  eliminate,  or  explain  variability  is  desirable.  There  will 
always  be  some  variability  which  cannot  be  explained  in  terms  of  exo¬ 
genous  variables.  This  variability  is  called  error.  "Error"  is  not 
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used  to  indicate  a  mistake,  but  rather  it  refers  to  variability  in  the 
data  that  is  not  accounted  for  by  the  sample  design.  Error  can  arise 
from  the  effect  of  an  unknown  (and  unmeasured)  exogenous  variable  or 
from  the  variability  introduced  in  the  measurement  of  a  variable.  This 
source  of  unexplained  variability  is  a  natural  property  of  the  target 
population,  and  the  Investigator  should  not  hope  to  eliminate  it,  only 
to  reduce  the  unexplained  variability  as  much  as  possible. 

441.  A  careful  definition  of  the  objectives  is  a  critical  step  in 
conducting  any  study.  Objectives  that  are  too  broad  and  poorly  defined 
generally  result  in  an  inefficient  sampling  program  which  may  not  ade¬ 
quately  meet  the  objectives  of  the  study.  The  need  for  this  simple,  but 
often  ignored,  step  of  establishing  study  objectives  was  described 
above . 

442.  At  this  point  the  investigator  has  fully  defined  his  vari¬ 
ables  of  Interest,  both  his  target  population  and  exogenous  variables, 
and  the  scope  of  the  study.  This  clearly  defines  the  sampling  objec¬ 
tives  and  facilitates  the  design  of  the  sample  placement. 


Sample  Allocation 
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443.  Once  the  objectives  are  clearly  stated,  sample  placement 
follows  fairly  easily.  Assume  for  the  moment  that  the  investigator  must 
decide  how  to  allocate  a  given  number  of  samples,  the  number  of  which 
will  be  referred  to  as  "n."  A  discussion  of  how  large  n  must  be  will 
follow.  The  discussions  of  statistical  analyses  which  follow  will  make 
one  common  assumption,  that  the  sampling  has  been  random.  If  the 
objective  is  to  sample  the  oxygen  concentrations  in  a  particular  cove  of 
a  reservoir,  the  analysis  will  assume  that  every  drop  of  water  in  that 
cove  had  an  equal  chance  of  being  sampled.  This  condition  is  met  when 
the  selection  of  a  particular  site  does  not  affect  the  choice  of  the 
second  and  subsequent  sites.  Other  types  of  sampling  will  be  discussed 
later,  but  for  the  moment,  random  sampling  will  be  assumed. 

444.  If  sampling  is  to  be  random,  the  individual  sites  and  depths 
for  each  sample  are  selected  at  random,  usually  from  a  random  number 
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table  or  random  number  generator.  If  the  sample  Is  to  be  completely 
random,  sampling  dates  may  even  be  selected  at  random  across  the  annual 
cycle.  However,  this  may  not  be  the  best  sampling  scheme,  and  it  may 
not  address  the  objectives  well. 

Identification  of  strata 

445.  Strata  are  subdivisions  of  a  larger  population  that  are  used 
to  reduce  variability  by  working  with  smaller,  more  homogenous  units  of 
the  total  target  population.  Strata  are  generally  used  where  parameters 
are  to  be  estimated.  Each  stratum  is  treated  as  a  separate  entity. 
Estimates  of  the  population  of  interest  are  obtained  for  each  stratum 
separately,  and  combined  for  a  final  estimate.  The  advantage  is  that 
the  variances  can  be  estimated  separately,  and  can  be  added.  Each  of 
the  variances  for  the  strata  is  likely  to  be  smaller  than  if  the  popula¬ 
tion  had  not  been  stratified. 

446.  After  the  strata  to  be  sampled  have  been  identified,  such 
that  variability  within  the  strata  is  more  homogeneous  than  the  whole, 
the  next  step  is  to  allocate  a  number  of  samples  to  each  stratum.  This 
is  called  stratified  random  sampling.  For  Instance,  suppose  a  decision 
has  been  made  to  stratify  the  samples  into  four  seasons  and  into  three 
depths,  with  the  expectation  of  relatively  little  variability  within 
each  stratum  as  compared  to  the  variability  which  exists  between  the 
strata.  Then  if  there  were  n  =  600  samples  to  be  allocated,  one 
option  is  to  sample  each  of  the  12  strata  (three  depths  in  4  months) 
equally.  This  calls  for  the  placement  of  50  samples  (600/12)  in  each  of 
the  strata. 

447.  Sampling  schemes  other  than  equal  allocation  are  possible. 
For  instance,  if  the  volume  of  water  at  one  depth  is  only  20  percent  of 
the  total  while  the  other  depths  contain  40  percent  each,  allocation  of 
the  number  of  samples  to  each  depths  may  be  in  proportion  to  the  volume 
of  that  depth.  This  is  called  proportional  allocation.  This  type  of 
allocation  can  also  consider  the  variability  of  the  strata,  in  addition 
to  its  size.  If  one  stratum  is  more  variable  than  another,  it  can 
receive  more  sampling  effort  in  proportion  to  its  variability.  This 
yields  a  greater  precision  for  the  more  variable  strata,  and  a  more 


precise  estimate  overall.  One  additional  consideration  is  cost.  A  sam¬ 
ple  allocation  capable  of  considering  cost,  in  addition  to  the  factors 
mentioned  above,  is  called  optimal  allocation.  This  method  considers 
not  only  the  size  and  variability  of  a  stratum,  but  also  the  cost  of 
sampling  each  of  the  strata.  In  this  case,  the  size  of  the  sample 
allocation  to  the  strata  is  decreased  in  proportion  to  the  inverse  of 
the  cost  of  sampling  the  strata  (i.e.,  strata  that  are  less  costly  to 
sample  are  allocated  a  greater  number  of  samples) . 

448.  The  mathematical  formulation  for  each  of  the  sample  alloca¬ 
tions  mentioned  is  discussed  below.  Allocation  of  sampling  units  for 
the  estimation  of  a  population  can  take  three  factors  into  account,  the 
expected  size  of  the  strata,  the  variance  of  the  strata,  and  the  cost  of 
sampling  the  strata.  The  three  considerations  can  be  expressed  in  a 
single  formula,  which  may  be  simplified  if  particular  factors  are  not  to 
be  considered.  The  general  formula  is  given  by 


where 

s  *  the  number  of  strata 
i  *  the  stratum  number  (i  -  1,  2...s) 

■  the  size  of  the  iC^  stratum 

*  the  stan^grd  deviation  (square  root  of  the  variance) 
of  the  i  stratum 

c^  *  the  cost  of  sampling  the  i**1  stratum 
n  *  the  total  number  of  samples  to  be  allocated 
n^  -  the  number  of  samples  allocated  to  the  it'1  stratum 
449.  This  formula  contains  all  of  the  elements  in  deciding  opti¬ 
mal  allocation.  It  allows  for  consideration  of  size  of  the  strata, 
variance,  and  cost  of  sampling  each  stratum.  It  is  not  necessary  to 
include  every  factor.  If  the  cost  of  sampling  each  stratum  is  approxi¬ 
mately  the  same,  all  c^  factors  may  be  omitted.  Also,  if  the  vari¬ 
ances  are  the  same  or  the  population  sizes  are  the  same,  the  factor  o^. 


or  may  be  omitted,  respectively.  If  the  three  factors  are  the  same 

for  each  stratum.  It  Is  Interesting  to  note  that  the  results  of  the  sam¬ 
ple  allocations  are  the  same  for  the  formula  above  as  for  a  completely 
random  allocation  done  without  consideration  of  strata.  The  three  exam¬ 
ples  below  Illustrate  the  use  of  this  equation  under  differing 
conditions. 

450.  Assume  that  the  epilimnion,  metallmnion,  and  hypollmnlon  of 
a  reservoir  are  defined  as  the  three  strata  of  Interest.  Also  assume 
the  strata  differ  In  size  (volume)  such  that  the  epi:meta:hypo  ratio  Is 
6:3:1,  cost  of  sampling  and  variance  Is  constant  across  strata,  and  a 
total  of  30  samples  are  to  be  allocated.  Therefore, 


N  ■  6  1  -  3,  n  -  30 

n2  "  3  C1  “  c2  “  C3  "  C 

■  1  °1  *  a2  ’  °3  *  a 


and  the  equation  becomes 


n^  ■  n 


N  jCo/^c) 


[  (Nj  +  N2  +  N 3)(o/  v^c)  J 


and  can  be  simplified  to 


N, 


ni  "  "  V  Nj  +  N2  +  N3 


The  number  of  samples  allocated  to  each  of  the  strata  would  be 

ni  “  30  (rTTT7  ) "  18 

n2  30  (  6  +  3  +  1  )  “  9 
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The  advantage  to  this  sample  allocation  is  that  it  emphasizes  the  size 
of  each  stratum.  This  allocation  shifts  more  samples  to  strata  of 
greater  size. 

451.  Given  the  three  strata  defined  above,  the  variability  within 
strata  might  not  be  expected  to  be  constant.  Variability  might  be  at  a 
minimum  in  the  well-mixed  epilimnion  and  Increase  to  a  maximum  in  the 
hypolimnlon.  Assume  that  variability  is  not  constant 


o  2  “  ^ 


and  that  all  other  terms  are  as  given  above.  The  equation  for  sample 
allocation  becomes 


Nlal  +  N2°2  +  N3°3 


where  +  1*2<j2  +  -  6(1)  +  3(2)  +  1(4)  -  16  . 

samples  allocated  to  each  stratum  would  be 


The  number  of 


"1  ■ 30  [t?]-  ii-23  *  11 
n2  -  30  [tI1]-  21-25  11 

"3  ’  30  1^1-  7'5  “  8 


The  result  is  that  more  samples  are  allocated  to  the  more  variable 
metallmnion  and  hypolimnlon.  The  advantage  to  this  sample  allocation  is 
that  it  provides  the  lowest  possible  overall  variance  in  the  final  esti¬ 
mation  of  the  parameter  for  all  strata  combined.  The  allocation  shifts 
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more  samples  to  strata  which  are  more  variable.  The  final  variance  of 
each  stratum  is  then  reduced  proportionally  to  the  sample  size  allocated 
to  that  stratum.  The  variances  of  other  strata  will  Increase  somewhat 
if  the  total  sample  size  is  kept  constant,  but  the  variance  of  the  final 
combined  estimate  will  be  minimized. 

452.  Cost  of  sampling  might  also  be  expected  to  differ  among  the 
strata.  The  surface  area  represented  by  the  epllimnion  may  be  much 
larger  than  that  of  the  hypolimnlon  and,  as  a  result,  the  time  involved 
in  sampling  (a  cost)  the  epllimnion  could  be  considerably  greater  than 
for  sampling  the  hypolimnlon.  Assume  that  cost  is  not  constant 


and  all  other  terms  are  as  given  above.  The  equation  for  sample  alloca¬ 
tion  becomes 


n  -  n 


Ni°i//Ei 


i\NlCTl^'/r^I  +  N2°2/|,'/^2  +  N3°3^  v/^3 


where 


^  y/z~2  ^  N3O3/  y C3  * 

6(l)//2  +  3(2)/yi75  +  1(4)/ yi  -13.14 


The  number  of  samples  allocated  to  each  of  the  strata  would  be 


n1  -  30 


n2  -  30 


6(l)//2 


13.14 

3(2)/  v/l.5  I  _ 


9.69  =  10 
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The  advantage  to  this  sample  allocation  is  that  it  provides  the  lowest 
possible  overall  variance  in  the  final  estimation  of  the  parameter  for 
all  strata  combined  for  a  given  investment.  Conversely,  for  a  given 
precision,  this  method  could  also  define  a  minimum  cost.  It  incorpo¬ 
rates  the  above  considerations  but  will,  in  addition,  weight  the  sample 
allocation  to  strata  which  are  less  costly  to  sample. 

453.  The  investigator  can  obtain  either  a  combined  mean  or  a  com¬ 
bined  total  population  estimate  from  a  stratified  sample.  The  estima¬ 
tion  of  a  population  mean  from  a  stratified  random  sample  is  the 
weighted  mean  of  each  of  the  individual  strata,  weighted  to  account  for 
the  number  of  samples  in  each  of  the  strata.  The  total  population 
estimate  is  the  sum  of  the  estimates  for  each  of  the  individual  strata. 
The  estimate  of  the  variance  for  the  combined  strata  is  given  by  the 
weighted  sum  of  the  variances  for  the  individual  strata. 

Determining  the  Number  of  Samples 

454.  It  is  possible  to  estimate  how  large  a  sample  size  is 
required  to  achieve  a  particular  level  of  precision.  The  objective  is 
to  estimate  some  parameter  to  within  a  specified  tolerance.  The  final 
statement  of  the  results  may  be  expressed  as  a  confidence  interval. 

This  statement  is  expressed  in  the  form 

P(lower  limit  ^  true  population  parameter  S  upper  limit) 

=  confidence  level 

455.  There  is  always  a  chance  that  the  interval  does  not  contain 
the  true  value  of  the  population  parameter.  For  this  reason  the  confi¬ 
dence  level,  expressed  as  a  probability,  will  always  be  less  than  one, 
or  100  percent.  For  instance,  the  investigator  may  be  95  percent  sure 
that  the  interval  given  contains  the  true  population  value.  The  width 
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of  the  interval  reflects  both  the  level  of  confidence  and  the  variabil¬ 
ity  in  the  population. 

456.  A  more  complete  discussion  of  confidence  intervals  is  pre¬ 
sented  in  the  section  on  statistical  analysis.  For  the  moment  it  is 
only  necessary  that  the  reader  understand  that  the  objective  of  calcu¬ 
lating  a  "required  minimum  sample  size"  is  to  obtain  a  sample  size  from 
which  a  confidence  interval  can  be  calculated  which  will  contain  the 
true  population  parameter  within  a  specified  level  of  tolerance. 

457.  The  formula  for  calculation  of  sample  size  requirements  is 
simple,  but  obtaining  good  estimates  of  the  factors  contained  in  the 
formula  is  often  difficult.  The  formula  is  given  by 

,22,2 

n  2  t  a  /p 

where  n  is  the  sample  size  needed  for  a  particular  probability  level 
of  error  (or  confidence)  in  the  eventual  confidence  interval  calcula¬ 
tion.  The  value  of  t  is  given  by  this  probability.  The  variance  for 

2 

the  population  is  given  by  o  ,  and  a  desired  level  of  precision  or 

2 

tolerance  is  given  by  p  .  The  variance  factor  (a  )  refers  to  varia¬ 
bility  in  the  population.  It  dictates  that  for  a  given  level  of  con¬ 
fidence  and  precision,  more  variable  populations  require  larger  samples 
(larger  values  of  n  ).  The  factor  p  is  determined  by  the  researcher 
and  gives  a  level  of  precision  required.  For  example,  if  the  objective 
is  to  estimate  phosphorus  concentration,  the  researcher  may  decide  that 
the  estimate  should  be  within  2  yg  P /£.  of  the  actual  concentration. 

That  is,  the  estimate  will  be  the  actual  concentration  ±2  yg  P/4.  The 

2 

values  of  t  ,  p  ,  and  o  used  must  be  provided  by  the  investigator. 
Many  who  are  not  familiar  with  the  process,  or  with  the  population  to  be 
sampled,  may  be  at  a  loss  to  give  reasonable  estimates  to  these  values. 

458.  The  value  t  is  derived  from  statistical  theory.  The 
researcher  must  first  decide  the  probability  of  error  in  the  eventual 
results  of  the  study.  There  will  always  be  some  probability  of  error. 
For  example,  when  the  research  finally  states  that  the  actual  concentra¬ 
tion  is  20  +  2  yg  P /£,  there  is  always  a  possibility  that  he  will  be 


wrong.  If  he  sets  t  to  a  level  which  corresponds  to  a  5-percent  error 
rate,  he  will  be  wrong  five  times  in  100  estimates.  If  he  chooses  a 
1-percent  error  level,  he  will  be  wrong  only  one  time  out  of  100  esti¬ 
mates,  but  he  will  have  to  take  many  more  samples.  Generally,  research¬ 
ers  have  chosen  a  5-percent  or  1-percent  level  of  error.  The  t  values 
corresponding  to  these  levels  may  be  Initially  approximated  as  2  and 
2.6,  respectively.  Actual  values  of  t  can  be  obtained  from  statisti¬ 
cal  tables  when  the  expected  sample  size  has  been  estimated. 
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459.  The  estimate  of  the  variance  a  should  be  determined,  if 
possible,  from  the  population  to  be  sampled.  This  would  require  a  pre¬ 
liminary  study  of  the  population  and  an  estimate  of  the  variance.  Vari¬ 
ance  is  estimated  by  the  formula 


n  -  2  / 

E  (X.  -  xr  /(n  -  1) 

i=l  1  ' 


where 

*  value  of  the  i^  observation  of  the  target  population 
variable 

2  2 

s  *  estimate  of  o  ,  indicated  as  o2 
n  *  sample  size 

If  the  estimate  cannot  be  obtained  with  a  preliminary  sample,  an  approx¬ 
imation  may  be  obtained  from  published  values  from  similar  populations. 

460.  The  estimate  of  precision  is  one  of  the  more  difficult  to 
predetermine.  If  the  investigator  knows  that  he  would  like  to  determine 
the  final  estimate  within  ±2  ug  P/i,  this  value  would  be  used.  The 
desired  precision  can  also  be  expressed  as  a  percent  of  the  mean.  For 
example,  if  the  concentration  is  expected  to  be  about  20  ug  P/i,  the 
precision  might  be  requested  as  10  percent  of  this  value,  or  2  ug  P /l. 

A  level  of  precision  of  this  magnitude  is  often  appropriate  for  field 
studies.  More  refined  laboratory  studies  may  require  a  greater 


precision,  while  preliminary  surveys  or  short  studies  whose  objective  is 
to  guide  an  administrative  decision  may  require  less  precision. 

461.  The  investigator  should  understand  that  the  values  provided 
are  only  rough  estimates.  It  is  possible  that  when  the  final  calcula¬ 
tions  are  made,  the  tolerance  is  not  as  small  as  originally  specified. 

A  similar  development  of  the  same  minimum  sample  size  estimating  formula 
can  be  made  for  tests  of  hypothesis.  Again,  the  application  of  the 
formula  does  not  guarantee  that  detectable  differences  can  be  made  at 
the  level  originally  specified.  Still,  the  formula  provides  the  best 
estimates  available. 


Systematic  Sampling 

462.  All  of  the  sampling  discussed  previously  was  random  or 
stratified  random.  Systematic  sampling  is  generally  easier  to  carry  out 
than  random  sampling.  In  systematic  sampling  the  first  sample  placement 
is  generally  decided  at  random  within  an  Initial  region,  and  subsequent 
samples  are  taken  at  some  constant  distance  or  time  from  the  first.  For 
example,  the  first  sample  may  be  randomly  placed  near  the  headwaters  of 
a  reservoir,  and  subsequent  samples  are  taken  every  2  miles  down  the 
reservoir  from  the  previous  sample.  More  commonly,  sequential  samples 
are  taken  over  time.  The  first  sample  is  taken  in  January,  and  addi¬ 
tional  samples  are  taken  at  4-week  intervals  after  the  first.  Although 
this  is  a  popular  sampling  scheme,  there  are  several  subtle  problems 
that  can  originate  from  its  use.  The  eventual  statistical  analysis  of 
data  will  generally  require  an  assumption  of  random  sampling.  A  sys¬ 
tematic  sample  may  have  a  variance  that  is  either  greater  or  smaller 
than  a  random  sample. 

463.  Systematic  sampling  in  an  area,  such  as  over  the  surface  of 
a  reservoir,  often  entails  sampling  a  grid  of  stations.  This  sampling 
scheme  is  effective  in  covering  the  whole  range  of  variability  available 
in  the  area,  since  it  will  often  uncover  heterogeneities  missed  in  ran¬ 
dom  sampling.  However,  the  ability  of  systematic  sampling  to  cover  the 
range  of  variability  also  means  that  the  variability  can  be  greater  than 


if  sampling  is  done  randomly.  However,  additional  studies  have  shown 
that  variances  resulting  from  systematic  samples  are  frequently  smaller 
than  those  resulting  from  random  sampling.  Whether  the  variance  esti¬ 
mate  is  large  or  small  depends  on  the  distribution  of  the  population 
being  sampled. 

464.  In  either  case,  the  assumption  of  random  sampling  required 
for  the  eventual  statistical  analysis  will  not  be  met.  This  is  particu¬ 
larly  important  in  hypothesis  testing,  and  it  will  also  affect  confi¬ 
dence  interval  estimates.  However,  this  does  not  mean  that  estimates 
obtained  from  systematic  samples  are  necessarily  biased  or  lack 
precision. 

465.  The  detection  of  the  wider  range  of  variability  may  be 
desirable  in  preliminary  studies,  particularly  if  the  objectives  include 
the  definition  of  strata.  Systematic  sampling  may  be  better  for  some 
applications  such  as  defining  strata.  However,  once  the  strata  are 
defined,  the  sampling  should  be  done  at  random  within  the  strata,  if 
possible,  in  order  to  meet  the  assumptions. 


Cluster  Sampling 

466.  Another  type  of  sampling  design  is  cluster  sampling,  done  as 
either  one-  or  two-stage  sampling.  This  type  of  sampling  is  applicable 
when  the  sampling  unit  is  an  identifiable  group  of  individuals  or  a 
cluster  of  observations.  In  this  case  it  may  be  Impractical  to  sample 
all  of  the  Individuals  at  random,  so  the  clusters  are  identified  and  a 
sample  is  selected  at  random.  Then  the  cluster  is  completely  sampled 
(one-stage)  or  subsampled  (two-stage) . 

467.  As  an  example,  suppose  that  the  nitrogen  concentration  of 
first-order  streams  is  to  be  sampled  in  a  reservoir  drainage  basin. 

There  may  be  a  hundred  small  first-order  streams  in  the  basin.  Compil¬ 
ing  a  list  of  all  these  streams  for  a  random  sample  would  be  difficult, 
and  covering  the  whole  drainage  basin  to  provide  a  random  sample  may  not 
be  practical.  An  alternative  would  be  to  identify  Individual  river 
basins  entering  the  reservoir,  which  may  only  number  a  dozen,  and  to 


select  first-order  streams  in  two  stages.  First,  a  random  sample  of 
river  basins  Is  selected;  second,  a  random  sample  of  first-order  streams 
is  selected  within  each  of  the  river  basins  selected  for  sampling.  This 
greatly  simplifies  the  listing  of  feeder  streams  for  sampling,  since 
maps  for  only  a  sample  of  the  river  basins  must  be  searched,  rather  than 
for  all  of  the  river  basins.  Field  sampling  is  also  simplified,  since 
trips  need  not  be  made  into  all  of  the  river  basins  in  the  reservoir 
drainage  area. 

468.  The  analysis  of  cluster  samples  requires  the  estimation  of 
variance  at  two  levels,  the  between-cluster  variability  and  the  wlthln- 
cluster  variability.  The  total  variability  is  a  recombination  of  these 
two  levels.  The  slightly  more  complicated  calculation  of  the  combined 
variance  may  be  more  than  offset  by  the  application  of  this  more  practi¬ 
cal  sampling  scheme,  and  by  savings  in  cost  of  sampling. 

Types  of  Sampling  Programs 

469.  The  basic  types  of  water  quality  investigations  may  be 
placed  in  several  categories  depending  on  the  objectives  of  the  sampling 
program.  The  objectives  may  call  for  parameter  estimation,  a  test  of  a 
hypothesis,  or  the  development  of  a  predictive  model.  The  objectives 
are  not  mutually  exclusive  since  hypotheses  about  the  estimated  parame¬ 
ters  may  be  tested.  However,  the  primary  goal  should  be  stated  in  the 
objectives. 

470.  Data  collected  in  any  sampling  program  should  ultimately  be 
processed  by  some  statistical  technique.  Therefore,  some  important 
aspects  of  statistical  analysis,  such  as  the  hypotheses  to  be  tested  and 
which  sources  of  variability  should  be  Included  in  the  analysis,  are 
discussed  as  part  of  the  considerations  of  sampling. 

Parameter  estimation 

471.  The  primary  objective  in  many  programs  is  to  document  the 
status  of  an  area.  This  is  often  the  objective  of  baseline  or  pilot 
studies.  These  studies  are  designed  to  establish  the  normal  levels  of 
parameters  prior  to  impoundment  of  an  area  or  some  other  future  change. 
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The  eventual  analysis  resulting  from  this  type  of  study  would  generally 
be  some  parameter  estimation.  The  variables  measured  would  depend  on 
the  objectives  of  the  study.  The  objective  may  be  to  document  particu¬ 
lar  attributes  of  water  quality,  such  as  pH,  dissolved  oxygen,  turbid¬ 
ity,  or  alkalinity.  In  this  case  the  analysis  may  include  only  the  mean 
values  and  a  measure  of  variability. 

472.  Another  type  of  survey  is  the  pilot  study  of  an  area,  prior 
to  the  initiation  of  a  major  study.  Pilot  studies  may  have  as  their 
objective  the  estimation  of  parameters  and  the  determination  of  their 
range  and  variance.  Another  objective  of  pilot  studies  may  be  the 
identification  of  strata.  Stratification,  previously  discussed,  is 
simply  the  subdivision  of  an  area  into  smaller,  more  uniform  sections. 
This  is  an  important  aspect  of  reducing  the  variability  of  the  final 
estimates  produced  by  a  study.  Estimates  of  the  costs  of  sampling  may 
also  be  obtained  from  pilot  studies. 

Tests  of  hypotheses 

473.  Another  type  of  study  objective  may  be  the  testing  of  some 
specific  hypothesis.  There  are  two  basic  approaches  in  this  type  of 
study.  The  objective  may  be  to  test  the  existing  water  quality  against 
some  hypothesized  value,  or  it  may  be  to  test  the  equality  of  two  or 
more  areas,  seasons,  or  reservoirs.  Although  this  type  of  study  may  be 
done  as  a  survey,  many  aspects  of  designed  experiments  may  be  used  to 
improve  the  results. 

474.  A  test  of  hypothesis  may  be  formally  represented  by  the 
mathematical  statement 


H  :y  £  0 
o 


where 

y  -  mean  of  the  sampled  population 
8  -  hypothesized  value 
The  alternative  hypothesis  can  be  stated  as 

H.:y  <  6 

A 
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475.  Note  that  this  set  of  hypotheses  is  one-sided,  that  is,  the 
only  alternative  hypothesis  of  interest  is  one  where  the  parameter  esti¬ 
mates  are  below  the  hypothesized  value.  The  hypothesis  may  also  be  two- 
sided  as  in  the  example  below. 

476.  Tests  of  hypothesis  are  not  always  made  against  hypothesized 
values.  The  objective  may  be  to  test  for  a  difference  of  the  means 
between  two  areas.  Actually  there  is  an  hypothesized  value  here,  even 
though  it  is  not  known  in  advance  what  mean  values  the  reservoirs  will 
have.  The  hypothesis  in  this  case  is  that  the  difference  between  the 
mean  values  observed  for  the  two  reservoirs  will  be  zero.  The  alterna¬ 
tive  is  that  the  difference  is  not  zero,  indicating  that  a  difference 
exists. 

477.  These  hypotheses  may  be  represented  mathematically  as 


H  :y. 
o  1 


or 


H  :u.  -  U  •  0 
o  1  2 


where 

y^  ■  mean  of  the  first  population  sampled 
y 2  “  mean  of  the  second  population 

478.  In  statistically  testing  a  hypothesis,  the  primary  hypothe¬ 
sis  (Hq)  is  called  the  "null  hypothesis"  because  it  is  generally  a 
hypothesis  of  no  difference.  The  alternative  hypotheses  are 


HA:yi  i  U2 


or 


HA!U1  '  U2  *  ° 


479.  In  this  case  the  alternative  hypothesis  is  two-sided  since 
the  second  area  may  have  either  a  larger  value  than  the  first,  or  a 
smaller  value  than  the  first.  Either  case  is  of  interest,  and  both 
indicate  that  the  areas  are  not  equal.  The  hypothesis  in  this  type  of 
example  may  also  be  one-sided.  For  example,  the  hypothesis  may  be  that 
one  area  has  a  higher  dissolved  oxygen  concentration  than  some  other 
area. 

480.  Similar  tests  of  hypothesis  may  be  made  to  test  for  effects, 
that  is,  changes  due  to  some  treatment  that  has  been  administered  by  the 
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Investigator  or  has  occurred  naturally.  Here  again,  the  hypothesis  is 
one  of  "no  difference  exists"  against  the  alternative  that  "a  difference 
does  exist." 

Prediction 

481.  Predicting  the  value  of  a  variable  under  certain  conditions 
is  done  with  regression  analysis.  In  regression  analysis  the  objective 
is  generally  to  demonstrate  or  document  a  relationship  between  one 
variable  called  the  "dependent"  variable  and  one  or  more  variables 
called  the  "independent"  variable  or  variables. 

482.  A  major  difference  between  this  type  of  analysis  and  those 
discussed  previously  is  that,  in  parameter  estimation  and  hypothesis 
testing,  the  means  being  estimated  or  tested  could  have  been  for  some 
particular  group,  class,  or  category,  such  as  an  area  or  a  particular 
month.  Regression  analysis  requires  the  use  of  a  quantitative  variable. 
At  least  one  of  the  independent  variables  in  a  regression  analysis  will 
be  a  quantitative  variable,  as  opposed  to  a  qualitative,  categorical,  or 
group  variable. 

483.  Examples  of  relationships  that  may  be  documented  by  regres¬ 
sion  analysis  are  phosphorus-chlorophyll  relationships  in  reservoirs  or 
the  relationship  between  light  and  primary  productivity.  Regression 
analysis  can  also  be  used  to  test  for  relationships  between  some  vari¬ 
able  and  a  gradient,  or  to  test  for  trends  over  time.  Predictions  done 
with  regression  analysis  may  result  either  from  surveys  or  from  designed 
experiments. 

Sampling  a  spatial  gradient 

484.  Analysis  of  a  gradient  such  as  a  trend  with  depth  or  dis¬ 
tance  calls  for  the  measurement  of  the  variable  of  interest  and  for 
measurement  of  the  gradient  variable.  In  terms  of  material  already 
discussed,  the  gradient  variable  will  be  treated  as  a  covariable  to  the 
variable  of  interest.  As  usual,  the  hypothesis  is  that  the  gradient 
variable  will  account  for  some  additional  variability  of  the  variable  of 
interest. 

485.  The  relationship  between  a  gradient  and  the  response 
variable  may  be  either  a  simple  linear  function  or  some  nonlinear 
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relationship.  If  the  response  of  the  variable  of  interest  to  the 
covariable  is  known  to  be  linear,  or  can  be  assumed  to  be  linear,  the 
only  sampling  observations  required  are  from  either  end  of  the  gradient. 
If  the  gradient  is  not  a  simple  linear  function,  or  if  one  of  the  objec¬ 
tives  is  to  test  the  relationship  for  linearity,  the  gradient  should  be 
sampled  at  a  number  of  intermediate  points  along  its  range. 

486.  Two  aspects  of  analyzing  a  gradient  should  be  taken  into 
account.  First,  the  analysis  of  a  response  along  a  gradient  that  is  not 
linear  requires  placing  samples  along  the  gradient.  The  tendency  of 
many  investigators  is  to  then  spread  all  of  the  sample  points  out  along 
the  gradient.  The  second  aspect  of  analyzing  a  gradient  is  the  evalua¬ 
tion  of  the  adequacy  of  the  model  fitted  to  the  gradient.  This  will 
require  replication  of  the  samples  at  numerous  points  along  the 
gradient . 

487.  The  investigator  is  then  faced  with  a  decision.  Is  it  pref¬ 
erable  to  sample  more  points  along  the  gradient  with  few  replicates  at 
each  point,  or  to  sample  fewer  points  with  more  replicates?  As  stated 
previously,  if  the  relationship  is  linear,  then  only  two  points  need  be 
sampled  (sampling  the  ext  ernes  is  preferable) .  Whenever  the  relation¬ 
ship  is  known,  even  if  it  is  curved,  then  relatively  few  sampling  points 
are  needed  along  the  gradient.  The  minimum  number  of  points  depends  on 
the  relationship  that  is  to  be  used.  In  this  case,  more  samples  may  be 
used  as  replicates. 

488.  If  little  is  known  about  the  relationship,  or  if  the  rela¬ 
tionship  is  complex,  the  necessary  number  of  points  along  the  gradient 
increases.  At  the  same  time,  more  replicates  are  required  in  order  to 
test  the  adequacy  of  the  proposed  model.  It  is  better  in  this  case  to 
spread  as  many  points  as  possible  along  the  gradient,  but  to  insist  on 
some  replication  at  the  sampling  points. 

489.  It  is  generally  good  practice  for  the  placement  of  the 
points  along  the  gradient  to  be  approximately  equidistant,  but  care 
should  be  taken  to  randomly  select  points  within  gradient  segments  if 
there  is  a  danger  of  falling  in  step  with  some  natural  phenomenon.  How¬ 
ever,  equidistant  points  are  not  necessarily  the  optimal  distribution 


if,  for  instance,  a  particular  section  of  the  gradient  is  to  be  esti¬ 
mated  with  greater  precision. 

Sampling  over  time 


490.  When  samples  are  taken  over  time,  time  can  be  considered  as 
an  additional  variable  or  source  of  variability.  Time  can  either  be 
viewed  as  a  source  of  variability  that  should  be  blocked  into  more  homo¬ 
geneous  subunits  (quarters,  seasons,  or  months)  or  as  a  covariable.  If 
the  changes  in  the  variable  of  interest  are  expected  to  be  small,  or 
stable  for  long  periods  with  short  periods  of  change,  then  blocking  on 
time  may  be  desirable.  The  periods  to  be  blocked  would  depend  on  which 
periods  of  time  are  expected  to  be  stable.  If  treated  as  a  covariable, 
the  preceding  discussion  on  sampling  a  gradient  applies  quite  well  to 
sampling  over  time.  However,  two  considerations  should  be  included. 

491.  First,  "time"  is  not  likely  to  be  the  actual  influencing 
variable.  Changes  over  the  annual  cycle,  for  instance,  occur  because  of 
changing  light  intensity,  temperature,  etc.  "Time"  can  only  be  used  as 
a  surrogate  for  other  variables  that  are  either  unknown  or  whose  indi¬ 
vidual  Influences  on  the  target  variable  cannot  be  discerned. 

492.  The  second  consideration  is  that  the  relationship  with  time 
as  a  covariable  is  likely  to  be  complex.  Over  an  annual  cycle  the  rela¬ 
tionship  with  most  variables  is  unlikely  to  be  linear  (though  relation¬ 
ships  may  appear  linear  for  short  periods  of  time) .  Sampling  over  the 
annual  cycle,  with  time  as  a  covariable,  will  therefore  call  for  rela¬ 
tively  frequent  samples  over  time,  but  with  some  replication  within 
sampling  periods.  Care  should  also  be  taken  to  sample  randomly  within 
each  of  the  sampling  periods. 

Frequency  of  sampling 

493.  The  frequency  of  samples  over  time  depends  on  the  definition 
of  the  target  population  and  whether  time  is  to  be  considered  as  strata 
or  a  covariable.  If  stratified,  the  samples  within  the  time  strata 
should  be  randomly  allocated,  and  their  number  would  reflect  the  desired 
precision.  If  time  is  to  be  treated  as  a  covariable,  the  annual  cycle 
should  be  subdivided  into  a  larger  number  of  subunits,  and  sample  repli¬ 
cates  randomly  placed  within  the  subunits.  Monthly  sampling  is  usually 


adequate  to  detect  the  annual  pattern  of  changes  with  time.  If  the 
investigation  requires  the  detection  of  short-lived  phenomena,  then  more 
frequent  sampling  may  be  required  to  obtain  greater  resolution. 

Summary 

494.  There  are  several  aspects  of  sampling  with  which  the 
researcher  should  be  familiar  before  designing  a  sampling  program.  The 
design  used  will  depend  on  the  objectives  and  on  the  types  of  variables 
that  must  be  taken  into  consideration.  In  any  design,  the  primary 
concern  will  be  variance  or  variability.  Whether  the  objective  is 
parameter  estimation,  testing  of  a  hypothesis,  or  development  of  a  pre¬ 
dictive  model,  the  results  should  have  the  smallest  possible  component 
of  unexplained  variability.  The  main  concern  is  to  reduce,  eliminate, 
or  account  for  sources  of  variability. 

495.  The  sampling  plan  can  reduce  variability  in  several  ways. 

One  way  is  to  standardize  the  field  sampling  techniques  as  much  as  pos¬ 
sible.  Any  refinement  of  technique  that  will  contribute  to  uniformity 
will  aid  in  reducing  variability. 

496.  Not  all  variability  can  be  reduced  or  eliminated.  In  some 
cases  the  sampling  program  will  have  to  include  the  measurement  of  exo¬ 
genous  variables  (variables  that  are  not  part  of  the  study  but  which 
contribute  to  the  variability)  in  order  to  measure  and  account  for 
additional  variability.  In  other  cases,  the  parameter  estimates  cannot 
be  done  separately  for  each  area.  For  instance,  when  used  for  tests  of 
hypotheses,  it  is  convenient  to  combine  all  press  into  one  analysis  in 
order  to  obtain  a  pooled  estimate  of  variability.  Blocking  provides  a 
measure  of  variability  without  "eliminating"  it  and  allows  for  all  data 
to  be  combined  into  a  single  analysis  for  testing  hypotheses. 

497.  Previous  mention  has  been  made  of  stratification  as  a  method 
of  eliminating  variability.  Although  there  is  no  real  difference  in  the 
concept  of  a  "block"  or  a  "stratum,"  there  is  a  difference  in  the  way  in 
which  the  two  are  applied.  A  block  may  be  a  subdivision  of  a  larger 
population,  such  that  the  variability  within  the  individual  block  is 
less  than  for  the  whole,  which  is  also  true  of  strata.  Blocking  is  a 
term  generally  applied  to  situations  in  which  hypotheses  are  to  be 


tested,  and  it  Is  simply  a  convenience  used  to  account  for  a  fraction  of 
the  variability  (the  between-block  variability) .  The  variance  within 
the  blocks  is  assumed  to  be  the  same,  and  a  pooled  estimate  of  the 
variance  is  obtained  for  the  best  test  of  the  hypothesis.  If  the  vari¬ 
ances  are  not  the  same,  efforts  must  be  made  (such  as  transformation)  to 
ensure  their  uniformity. 

498.  Other  exogenous  variables  will  also  explain  or  remove  a 
certain  part  of  the  variability.  For  a  previous  example  of  primary 
productivity,  it  was  as  Important  to  know  the  light  conditions  as  it  was 
to  know  the  season.  In  fact,  the  two  variables,  season  and  light,  may 
be  explaining  the  same  variability  in  productivity.  That  is,  the  season 
variability  may  be  due  to  changing  light,  and  some  of  the  light  varia¬ 
bility  is  due  to  changing  seasons.  On  the  other  hand,  season  may 
account  for  some  variability  which  is  not  explained  by  light,  such  as 
changes  in  nutrient  availability  and  species  composition,  so  that  light 
may  not  be  able  to  completely  explain  the  seasonal  variability.  In  this 
example  the  light  is  a  quantitative  variable  called  a  "covariable"  and 
season  is  a  class  variable,  where  each  season  is  a  "block"  or  category 
of  the  class  variable. 
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APPENDIX  A:  GLOSSARY 


accuracy — the  nearness  of  a  measurement  to  the  actual  value  of  the 
variable  measured. 

alpha  error — see  Type  I  error. 

alternative  hypothesis — the  hypothesis  that  remains  tenable  when 
the  null  hypothesis  is  rejected. 

beta  error — see  Type  II  error. 

central  tendency — the  tendency  for  a  majority  of  measurements  to 
lie  near  the  middle  of  the  range  of  the  entire  set  of  measurements. 

descriptive  statistics - a  means  to  organize  and  summarize  data. 

frequency  distribution — the  distribution  of  the  total  number  of 
observations  among  a  set  of  categories. 

heterogeneity  of  variance — variances  of  different  samples  are  not 
equal;  also  referred  to  as  heteroscedasticity . 

homogeneity  of  variance — variances  of  different  samples  are  equal; 
also  referred  to  as  homoscedasticity . 

inferential  statistics — a  means  to  make  generalized  conclusions, 
based  on  the  inference  of  characteristics  of  the  population  drawn  from 
the  characteristics  of  the  sample. 

level  of  significance — the  probability  level  that  is  considered  to 
be  too  low  to  justify  support  of  the  hypothesis  being  tested. 

Model  I  ANOVA — an  analysis  of  variance  involving  fixed  effects, 
i.e.,  the  levels  of  a  factor  are  specifically  chosen. 

Model  II  ANOVA — an  analysis  of  variance  involving  random  effects, 
i.e.,  the  levels  of  a  factor  are  chosen  at  random. 

Model  III  ANOVA — a  factorial  analysis  of  variance  that  includes 
both  fixed  and  random  effects;  also  referred  to  as  a  mixed  model. 

nonparametric  statistics — statistical  methods  that  draw  inferences 
about  populations  but  not  their  parameters;  since  these  methods  do  not 
make  assumptions  about  the  distribution  of  the  sampled  populations,  they 
are  also  referred  to  as  distribution-free  statistics. 

null  hypothesis — statement  of  "no  difference"  (see  Alternative 
Hypothesis) . 

parameter — a  quantity  characteristic  of  a  population  (see 
Statistic) . 

parametric  statistics — statistical  methods  that  make  inferences 
about  a  population's  parameters;  these  methods  generally  assume  random 
sampling,  a  normal  distribution,  and  homogeneity  of  variance. 


population — the  entire  collection  of  measurements  about  which  one 
wishes  to  draw  conclusions;  sometimes  referred  to  as  the  universe  or 
target  population. 

power — the  probability  of  rejecting  the  null  hypothesis  when  it  is 
false  and  should  be  rejected. 

precision — refers  to  the  closeness  of  repeated  measures  of  the 
same  quantity. 

random  sample — a  sample  in  which  each  member  of  the  population  had 
an  equal  and  Independent  chance  of  being  sampled. 

robust — refers  to  how  sensitive  the  validity  of  a  given  statisti¬ 
cal  test  is  to  minor  deviations  from  the  assumptions  of  the  best. 

sample — a  subset  of  all  possible  measurements  of  the  population. 

skewed  distribution — a  frequency  distribution  in  which  the  mean 
and  median  are  not  identical. 

statistic —  a  quantity  estimated  from  sample  data;  an  estimate  of 
a  population  parameter. 

statistical  hypothesis — a  statement  about  a  statistical  population 
which  one  seeks  to  accept  or  reject  on  the  basis  of  observed  data. 

statistical  test — a  set  of  rules  by  which  the  decision  about  a 
statistical  hypothesis  is  made. 

symmetrical  distribution — a  frequency  distribution  in  which  the 
mean  and  median  are  identical. 


target  population — the  statistical  population  about  which  infer¬ 
ences  are  to  be  made  based  on  sample  data  (see  Population) . 

transformation — a  mathematical  operation  applied  to  sample  data  to 
correct  for  a  nonnormal  distribution  and  heterogeneity  of  variance. 

Type  I  error — the  rejection  of  the  null  hypothesis  when  it  is  in 
fact  true;  also  called  an  alpha  error. 

Type  II  error — the  acceptance  of  the  null  hypothesis  when  it  is  in 
fact  false;  also  called  a  beta  error. 

variable — a  characteristic  that  varies  from  one  entity  to  another. 


